Review of allele and genotypic frequencies
Overview
We will learn how to compute allele and genotypic frequencies in R using the cattle data set.
Read a file
Use the function read.table
to read the genotype file Geno.txt
in a data frame format. We will store the genotype data in the variable W
.
W <- read.table(file = file.choose(), header = TRUE, stringsAsFactors = FALSE)
We can access a certain element in the data frame by entering its coordinate in the single square bracket []
operator. Let’s first access the element in the first column and the first row. When the row coordinate is omitted, the operator returns a data frame with just a single column.
W[1, 1] # 1st row and 1st column
head(W[, 1]) # one-column data frame
The following code shows the first five rows and columns.
W[1:5, 1:5]
We then drop the first column of data frame, which is the animal IDs. The -
sign indicates dropping variables. So, -1
means dropping the first column.
W <- W[, -1]
Next, we will convert W
into a matrix from a data frame. In R, matrices are more memory efficient and convenient than the other data types to do linear algebra.
W <- as.matrix(W)
What is the dimension of W
?
dim(W)
Allele frequency
Let’s compute the allele frequency of the first SNP. The table
function returns frequncies of genotypes.
table(W[, 1])
We can see that there are 100 AA animals, 475 Aa animals, and 429 aa animals. Let’s assign these numbers into variables.
nAA <- table(W[, 1])[3]
nAa <- table(W[, 1])[2]
naa <- table(W[, 1])[1]
Allele frequency of A is given by \[ f(A) = p = \frac{2 \times (\text{no. of } AA \text{ individuals}) + 1 \times (\text{no. of } Aa \text{ individuals})}{2 \times \text{total no. of individuals}}. \]
Exercise 1
Use the variables nAA
, nAa
, and naa
defined above and compute the allele frequencies of A and a in the first SNP.
Genotypic frequency
Genotypic frequency is given by \[ f(AA) = P = \frac{\text{No. of } AA \text{ individuals}}{\text{Total no. individuals}} \\ f(Aa) = H = \frac{\text{No. of } Aa \text{ individuals}}{\text{Total no. individuals}} \\ f(aa) = Q = \frac{\text{No. of } aa \text{ individuals}}{\text{Total no. individuals}}. \\ \]
Exercise 2
What are the genotypic frequencies of AA
, Aa
, and aa
in the first SNP?
Another approach for obtaining allele frequency
\[ f(A) = p = \frac{2 \times (\text{frequency of } AA) + 1 \times (\text{frequency of } Aa)}{2 \times (\text{frequency of } AA + Aa + aa)}. \]
Exercise 3
Use the variables P
, H
, and Q
defined above and compute the allele frequencies of A and a in the first SNP.
Exercise 4
What are the genotypic frequencies of AA
, Aa
, and aa
in the second SNP?
nAA <- table(W[, 2])[3]
nAa <- table(W[, 2])[2]
naa <- table(W[, 2])[1]
p <- (2 * nAA + 1 * nAa)/(2 * (nAA + nAa + naa))
p
q <- 1 - p
q
Compute allele frequencies for all SNPs
So far we have learned how to compute the allele frequency of a single SNP using the table
function. Here, we consider how to compute allele frequencies for the entire SNPs. Of course we can apply the table
function manually one at a time. However, this approach takes too much time to compute allele frequencies for 6,960 SNPs. Recall that allele frequency of A is given by \[
f(A) = p = \frac{2 \times (\text{no. of } AA \text{ individuals}) + 1 \times (\text{no. of } Aa \text{ individuals})}{2 \times \text{total no. of individuals}}.
\] We can rewrite this equation into \[
f(A) = p = \frac{(\text{no. of } A \text{ allele in the population})}{2 \times \text{total no. of individuals}}.
\] This suggests that all we need is the number of \(A\) allele or reference allele \(a\) for each SNP. The sum
function returns the number of reference allele \(A\).
sum(W[, 1]) # sum of A allele in the first SNP
sum(W[, 2]) # sum of A allele in the second SNP
How to repeat this operation for 6,960 SNPs? The colSums
function returns the sum of each column in a matrix as a vector.
colSums(W)
Note that colSums(W)
gives the numerator of the above equation. We then divide this number by \(2 \times \text{total no. of individuals}\). The function nrows
returns the number of rows.
p <- colSums(W)/(2 * nrow(W))
The variable p
is a vector and it contains the allele frequencies of reference allele for 6,960 SNPs.
Exercise 5
What is the allele frequency of reference allele in the 300th SNP?
Exercise 6
What is the mean of reference allele frquencies in this population?
Minor allele frequency
In most cases, people report a minor allele frequency, which is the frequency of less frequent allele in a given SNP. We can convert allele frequencies into minor allele frquencies by using the ifelse
function.
maf <- ifelse(p > 0.5, 1 - p, p)
Exercise 7
What is the minor allele frquency of reference allele in the 300th SNP?
Exercise 8
What is the mean of minor allele frquencies?
Visualization
Now let’s visualize the minor allele frequencies for the first 500 SNPs.
plot(maf[1:500])
Save R objects
Save the variable W
so that we can reuse it in the next class.
save(W, file = "W.Rda")