Genome-wide allele frequencies

Overview

We will learn how to compute 1) genome-wide allele frequencies and 2) expectation and variance of allelic counts.

Read a file

Use the function load to reload the SNP matrix W.Rda, which we saved as an R object in the previous class.

load(file = file.choose())

Compute allele frequencies for all SNPs

We have learned how to compute allele frequency of the first SNP using the table function. Here, we consider how to compute allele frequencies for the entire SNPs. Of course we can apply the table function manually one at a time. However, this approach takes too much time to compute allele frequencies for 5,000 SNPs. Recall that allele frequency of A is given by \[ f(A) = p = \frac{2 \times (\text{no. of } AA \text{ individuals}) + 1 \times (\text{no. of } Aa \text{ individuals})}{2 \times \text{total no. of individuals}}. \] We can rewrite this equation into \[ f(A) = p = \frac{(\text{no. of } A \text{ allele in the population})}{2 \times \text{total no. of individuals}}. \] This suggests that all we need is the number of \(A\) allele or reference allele \(a\) for each SNP. The sum function returns the number of reference allele \(A\).

sum(W[, 1])  # sum of A allele in the first SNP
sum(W[, 2])  # sum of A allele in the second SNP

How to repeat this operation for 5,000 SNPs? The colSums function returns the sum of each column in a matrix as a vector.

colSums(W)

Note that colSums(W) gives the numerator of the above equation. We then divide this number by \(2 \times \text{total no. of individuals}\).

p <- colSums(W)/(2 * nrow(W))

The variable p is a vector and it contains the allele frequencies of reference allele for 5,000 SNPs.

Exercise 1

What is the allele frquency of reference allele in the 400th SNP?

Exercise 2

What is the mean of reference allele frquencies in this population?

Minor allele frequency

In most cases, people report a minor allele frequency, which is the frequency of the less frequent allele in a given SNP. We can convert allele frequencies into minor allele frquencies by using the ifelse function.

maf <- ifelse(p > 0.5, 1 - p, p)

Exercise 3

What is the minor allele frquency of reference allele in the 400th SNP?

Exercise 4

What is the mean of minor allele frquencies?

Now let’s visualize the minor allele frequencies for the first 500 SNPs.

plot(maf[1:500])

Expectation and variance of allelic counts

Exercise 5

Recall that the expectation of genotype, \(E(W)\), is given by \(2p\), where \(p\) is the frequency of reference allele. Verify that \(2p\) is equal to the mean of each genotype obtained from the colMeans() function. Use the variables W and p.

Exercise 6

Recall that the variance of genotype, \(Var(W)\), is given by \(2p(1-p)\). Verify that \(2p(1-p)\) is close to the variance of each genotype obtained from the var() function. Use the variables W and p.

Gota Morota

January 26, 2017