Interactive deterministic formulas for genomic prediction

class: center, middle, inverse, title-slide

# Interactive deterministic formulas for genomic prediction
## ASCI431-831: Advanced Animal Breeding
### Gota Morota <br /><br /> <a href="http://morotalab.org/" class="uri">http://morotalab.org/</a>
### 2018/04/12

---

# Shiny - [https://shiny.rstudio.com/](https://shiny.rstudio.com/)

- A web application framework for **interactive** visualization

- Able to generate user friendly web interfaces

- Built on a reactive programming model

- Entirely extensible 
   - custom inputs and outputs
   - CSS themes
   - JavaScript and D3.js

- Example - [Collision Detection](https://bl.ocks.org/mbostock/raw/3231298/)

---

# Shiny framework 
<img src="Shinyframework.png" height="300px" width="650px"/>

**Template**

```r
library(shiny)

ui <- fluidPage()
server <- function(input, output) {}

shinyApp(ui = ui, server = server)
```

---

# Control widgets 
<img src="widgets.png" width=700 height=470>
.left[[RStudio](https://shiny.rstudio.com/tutorial/written-tutorial/lesson3/)]
.right[[Example](https://github.com/morota/asci431-2018/blob/gh-pages/day28/ex.R)]

---
# Cross-validation for genomic prediction
<div align="center">
<img src="Fig1CV.png" width=650 height=450>
</div>
.right[[doi:10.1093/jas/sky014](http://dx.doi.org/10.1093/jas/sky014)]

---

# Deterministic genomic prediction formulas

-  highlight the relationships among prediction accuracy and potential factors influencing prediction accuracy

- no computationally intensive cross-validation

- prior to genotyping individuals

---
class: inverse, left, middle

# ShinyGPAS - Shiny Genomic Prediction Accuracy Simulator

Can be used for 
- _interactive_ exploration of potential factors influencing prediction accuracy

- simulation of achievable prediction accuracy 
   - prior to genotyping individuals or performing CV
   
- supporting in-class teaching

- no knowlege of R, HTML, JavaScript, or CSS is required. R code encapsulated as a web-based Shiny application

Available at [https://chikudaisei.shinyapps.io/shinygpas/](https://chikudaisei.shinyapps.io/shinygpas/) and [https://github.com/morota/ShinyGPAS](https://github.com/morota/ShinyGPAS)

---

# Deterministic formula (1)
Daetwyler et al. (2008; 2010)
`\begin{align}
r &= \sqrt{\frac{N h^2}{N h^2 + M_e} }
\end{align}`
where `\(N\)` is the number of individuals in the reference population, `\(h^2\)` is the heritability, and `\(M_e\)` is the number of independent chromosome segments.

- treating genetic markers as fixed

- independence of quantitative trait loci (QTL)

- regression of phenotypes on genotype one locus at a time

- `\(\sigma^2_{\epsilon}\)` and `\(\sigma^2_g + \sigma^2_{\epsilon}=1\)`

- identical accuracy of QTL effect size estimates across QTL allele frequencies

- perfect linkage disequilibrium (LD) between marker and QTL pairs

---

# Deterministic formula (2)
Goddard (2009)
`\begin{align}
r &= \sqrt{1 - \frac{\lambda}{2N\sqrt{\alpha}} \ln\left( \frac{1 + \alpha + 2\sqrt{\alpha}}{1 + \alpha - 2\sqrt{\alpha}}\right) }
\end{align}`
where `\(\lambda\)` is `\(M_e/(h^2\ln(2N_e))\)` and `\(\alpha\)` is `\(1 + 2(M_e/Nh^2\ln(2N_e))\)`

- treating markers as random

- assuming complete LD between marker and QTL pairs

- QTL effects were assumed to be sampled from a normal distribution

- assumes that QTL with high minor allele frequencies have more accurate effect size than QTL with low minor allele frequencies

---

# Deterministic formula (3)
Goddard et al. (2011)
`\begin{align}
r &= \sqrt{b \frac{Nbh^2/M_e}{1 + Nbh^2/M_e}}
\end{align}`
where `\(b = M/(M + M_e)\)`  is the proportion of genetic variance explained by the markers and `\(M\)` is the is the number of markers.

- accounts for incomplete LD between markers and QTL

---

# Deterministic formula (4)
Rabier et al. (2016)
`\begin{align*}
r &= \sqrt{\frac{h^2/(1-h^2)}{M_e/N + h^2/(1-h^2)}}.
\end{align*}`

- relaxing the assumption of `\(\sigma^2_{\epsilon}\)` and `\(\sigma^2_g + \sigma^2_{\epsilon}=1\)`

- can be applied with any value of `\(\sigma^2_{\epsilon}\)` and `\(\sigma^2_g\)`

---

# Deterministic formula (5)
Rabier et al. (2016)
`\begin{align*}
r &= \sqrt{\frac{h^2/(1-h^2)}{\mathbb{E}(||\mathbf{x}'_{n_{\text{TRN} + 1}} \mathbf{X}'\mathbf{V}^{-1} ||^2)  + h^2/(1-h^2)}}
\end{align*}`
`\(M_e/N\)` is equal to `\(\mathbb{E}(||\mathbf{x}'_{n_{\text{TRN} + 1}} \mathbf{X}'\mathbf{V}^{-1} ||^2)\)` under RRBLUP.

- `\(\mathbf{x}'_{n_{\text{TRN} + 1}}\)` is the vector of markers belonging to the testing set individual
- `\(\mathbf{X}\)` is the training set allele content matrix 
- `\(\mathbf{V} = \mathbf{X}\mathbf{X}' + \lambda \mathbf{I}\)` 
- `\(\lambda\)` is the regularization parameter 
- `\(||.||^2\)` is the squared norm

Note that if we can assume individuals in training and testing sets were sampled from the same population, 
`\(\hat{\mathbb{E}}(||\mathbf{x}'_{n_{\text{TRN} + 1}} \mathbf{X}'\mathbf{V}^{-1} ||^2) \le 1\)`

---

# Deterministic formula (6)
de los Campos et al. (2013)

Under the genomic best linear unbiased prediction framework
`\begin{align}
r &= \sqrt{ [1 - (1 - b)^2] h^2 }
\end{align}`

- assuming that the patterns of allele sharing between markers and causal loci are different

- `\(b\)` is the average regression coefficient of the marker-based genomic relationships on genomic relationships at QTL

---

# Deterministic formula (7)
- Karaman et al. (2016)
`\begin{align}
r &=  \sqrt{ h^2_M \left[ \frac{N h^2_M}{N h^2_M + M_e}  \right] }
\end{align}`

- correlation between phenotypes and estimated breeding values (additive genetic values)

- `\(h^2_M\)` is the genomic heritability, which is the proportion of phenotypic variance that is explained by regression on markers.

---

# Deterministic formula (8)

- Wientjes et al. (2016)
`\begin{align}
r = \sqrt{ \begin{bmatrix} b_{AC} r_{G_{AC}} \sqrt{\frac{h^2_A}{M_{e_{A,C}}}  } & b_{BC} r_{G_{BC}} \sqrt{\frac{h^2_B}{M_{e_{B,C}}}} \end{bmatrix} \begin{bmatrix} \frac{h^2_A}{M_{e_{A,C}}} + \frac{1}{N_A} & r_{G_{AB}} \sqrt{\frac{h^2_A h^2_B}{M_{e_{A,C}} M_{e_{B,C}} }   } \\ r_{G_{AB}} \sqrt{\frac{h^2_A h^2_B}{M_{e_{A,C}} M_{e_{B,C}} }   } &  \frac{h^2_B}{M_{e_{B,C}}} + \frac{1}{N_B} \end{bmatrix}^{-1} } \\
\times \sqrt{\begin{bmatrix} b_{AC} r_{G_{AC}} \sqrt{\frac{h^2_A}{M_{e_{A,C}}}} \\ b_{BC} r_{G_{BC}} \sqrt{\frac{h^2_B}{M_{e_{B,C}}}} \end{bmatrix}}
\end{align}`

Combines two populations A and B to predict prediction accuracy in population C.

- `\(b_{AC}\)` is the square root of the proportion of the genetic variance in predicted population C explained by the markers in training population A
- `\(r_{G_{AC}}\)` is the genetic correlation between populations A and C
- `\(M_{e_{A,C}}\)` is the effective number of chromosome segments shared between populations A and C

---

# Paper
<img src="GSE.png" height="420px" width="710px"/>

[doi:10.1186/s12711-017-0368-4](https://doi.org/10.1186/s12711-017-0368-4)