## Data

We are going to use the mice data included in the BGLR package.

library(BGLR)
data(mice)
W <- mice.X[, 1:100]

We will learn how to compute allele and genotypic frequencies in R using the cattle data set.

Let’s compute the allele frequency of the first SNP. The table function returns frequncies of genotypes.

table(W[, 1])

We can see that there are 541 AA animals, 929 Aa animals, and 344 aa animals. Let’s assign these numbers into variables.

nAA <- table(W[, 1])[3]
nAa <- table(W[, 1])[2]
naa <- table(W[, 1])[1]

Allele frequency of A is given by $f(A) = p = \frac{2 \times (\text{no. of } AA \text{ individuals}) + 1 \times (\text{no. of } Aa \text{ individuals})}{2 \times \text{total no. of individuals}}.$

### Exercise 1

Use the variables nAA, nAa, and naa defined above and compute the allele frequencies of A and a in the first SNP.

p <- (2 * nAA + 1 * nAa)/(2 * (nAA + nAa + naa))
p
q <- 1 - p
q

## Genotypic frequency

Genotypic frequency is given by $f(AA) = P = \frac{\text{No. of } AA \text{ individuals}}{\text{Total no. individuals}} \\ f(Aa) = H = \frac{\text{No. of } Aa \text{ individuals}}{\text{Total no. individuals}} \\ f(aa) = Q = \frac{\text{No. of } aa \text{ individuals}}{\text{Total no. individuals}}. \\$

### Exercise 2

What are the genotypic frequencies of AA, Aa, and aa in the first SNP?

P <- (nAA)/(nAA + nAa + naa)
P
H <- (nAa)/(nAA + nAa + naa)
H
Q <- (naa)/(nAA + nAa + naa)
Q

## Another approach for obtaining allele frequency

$f(A) = p = \frac{2 \times (\text{frequency of } AA) + 1 \times (\text{frequency of } Aa)}{2 \times (\text{frequency of } AA + Aa + aa)}.$

### Exercise 3

Use the variables P, H, and Q defined above and compute the allele frequencies of A and a in the first SNP.

p.ex3 <- (2 * P + 1 * H)/(2 * (P + H + Q))
p.ex3
q.ex3 <- 1 - p.ex3
q.ex3

### Exercise 4

What are the genotypic frequencies of AA, Aa, and aa in the second SNP?

nAA <- table(W[, 2])[3]
nAa <- table(W[, 2])[2]
naa <- table(W[, 2])[1]

p <- (2 * nAA + 1 * nAa)/(2 * (nAA + nAa + naa))
p
q <- 1 - p
q

P <- (nAA)/(nAA + nAa + naa)
P
H <- (nAa)/(nAA + nAa + naa)
H
Q <- (naa)/(nAA + nAa + naa)
Q

## Compute allele frequencies for all SNPs

So far we have learned how to compute the allele frequency of a single SNP using the table function. Here, we consider how to compute allele frequencies for the entire SNPs. Of course we can apply the table function manually one at a time. However, this approach takes too much time to compute allele frequencies for 6,960 SNPs. Recall that allele frequency of A is given by $f(A) = p = \frac{2 \times (\text{no. of } AA \text{ individuals}) + 1 \times (\text{no. of } Aa \text{ individuals})}{2 \times \text{total no. of individuals}}.$ We can rewrite this equation into $f(A) = p = \frac{(\text{no. of } A \text{ allele in the population})}{2 \times \text{total no. of individuals}}.$ This suggests that all we need is the number of $$A$$ allele or reference allele $$a$$ for each SNP. The sum function returns the number of reference allele $$A$$.

sum(W[, 1])  # sum of A allele in the first SNP
sum(W[, 2])  # sum of A allele in the second SNP

How to repeat this operation for 300 SNPs? The colSums function returns the sum of each column in a matrix as a vector.

colSums(W)

Note that colSums(W) gives the numerator of the above equation. We then divide this number by $$2 \times \text{total no. of individuals}$$. The function nrows returns the number of rows.

p <- colSums(W)/(2 * nrow(W))

The variable p is a vector and it contains the allele frequencies of reference allele for 6,960 SNPs.

### Exercise 5

What is the allele frequency of reference allele in the 50th SNP?

p[50]

## Minor allele frequency

In most cases, people report a minor allele frequency, which is the frequency of less frequent allele in a given SNP. We can convert allele frequencies into minor allele frquencies by using the ifelse function.

maf <- ifelse(p > 0.5, 1 - p, p)

### Exercise 6

What is the minor allele frquency of reference allele in the 30th SNP?

maf[30]

### Exercise 7

What is the mean of minor allele frquencies?

mean(maf)

## Expectations and variances

### Exercise 8

Compute the allele frequency of SNP markers. Recall that the expectation of genotype, $$E(W)$$, is given by $$2p$$, where $$p$$ is the frequency of reference allele. Verify that $$2p$$ is equal to the mean of each genotype obtained from the colMeans() function.

p3 <- (2 * colSums(W == 2) + 1 * colSums(W == 1))/(2 * nrow(W))
table(colMeans(W) == 2 * p3)

p <- colSums(W)/(2 * nrow(W))
table(colMeans(W) == 2 * p)

p2 <- colMeans(W)/2
table(p == p2)

### Exercise 9

Recall that the variance of genotype, $$Var(W)$$, is given by $$2p(1-p)$$. Verify that $$2p(1-p)$$ is close to the variance of each genotype obtained from the var() function.

var1 <- 2 * p * (1 - p)
var2 <- diag(var(W))
var2 <- apply(W, 2, var)
table(var1 == var2)
all.equal(var1, var2, tol = 0.05)

### Exercise 10

Create a new marker matrix X from W and recode markers so that three genotypes $$AA$$, $$Aa$$, and $$aa$$ are coded as 1, 0, and -1, respectively.

X <- W - 1

Recall that the expectation of genotype, $$E(X)$$, is given by $$2p-1$$, where $$p$$ is the frequency of reference allele. Verify that $$2p-1$$ is equal to the mean of each genotype obtained from the colMeans() function.

p2 <- (2 * colSums(X == 1) + 1 * colSums(X == 0))/(2 * nrow(X))
table(p == p2)
all.equal(2 * p - 1, colMeans(X), tol = 0.005)

### Exercise 12

Recall that the variance of genotype, $$Var(X)$$, remains the same and is given by $$2p(1-p)$$. Verify that $$2p(1-p)$$ is close to the variance of each genotype obtained from the var() function.

var1 <- 2 * p * (1 - p)
var2 <- var(X)
var2 <- apply(X, 2, var)
all.equal(var1, var2, tol = 0.05)

### Exercise 13

Verify that no matter how we code markers, centered marker codes, $$W - E(W)$$ and $$X - E(X)$$, remain the same.

### Exercise 14

We will recode the SNP genotypes so that now the major allele is treated as a reference allele. Store the new coding into the W2 variable.

W2 <- W
W2[W2 == 0] <- 3
W2[W2 == 2] <- 0
W2[W2 == 3] <- 2

Compute the allele freqeuncy of SNP markers using W2. Compare your result with the allele frequency obtained from W.

# Freq of A (W)
p <- colSums(W)/(2 * nrow(W))
# Freq of A (W2)
pW2 <- colSums(W2)/(2 * nrow(W2))

table(W2[, 16])