Ordinary least squares and BLUE

Overview

We will learn how to connect phenotypes and systematic effects (fixed effects) using ordinary least squares (OLS).

Read file

Use the function read.table() to read the phenotype file (1000withEffects.redangus).

dat <- read.table(file = file.choose(), header = TRUE, stringsAsFactors = FALSE)
dim(dat)

Phenotypes

The hist() function plot a histogram.

hist(dat$BWT, col = "lightblue", main = "Histogram", xlab = "Birth weights")

The argument prob=TRUE returns a probability on the y axis. We use the lines() function to add a density estimate along with the histogram.

hist(dat$BWT, col = "lightblue", main = "Histogram", prob = TRUE, ylim = c(0, 
    0.04), xlab = "Birth weights")
lines(density(dat$BWT), col = "red")

Exercise 1

What are the mean, median, standard deviation, maximum, and minimum of birth weights?

Systematic effects

We will estimate the effects of age of dams and sex of calves on birth weights. Let’s first check the distributions of the systematic effects.

Age of dams

The hist() functions plot a histogram.

hist(dat$AgeDam.mon., col = "lightblue", main = "Histogram", xlab = "Age of dams")

Below is a histogram with the argumment prob=TRUE.

hist(dat$AgeDam.mon., col = "lightblue", main = "Histogram", prob = TRUE, xlab = "Age of dams")
lines(density(dat$AgeDam.mon), col = "red")

Exercise 2

What are the mean, median, standard deviation, maximum, and minimum of age of dams?

Sex of calves

The barplot() function generates a barplot.

barplot(table(dat$SEX), main = "Distribution", ylab = "Frequency", names.arg = c("Bulls", 
    "Cows", "Steers"), col = "lightblue")

Because there is only one steer, we treat him as a bull.

table(dat$SEX)
dat$SEX[dat$SEX == "S"] <- "B"
table(dat$SEX)

Birth weights and age of dams

Here we visualize the relationship between age of dams and birth weights.

plot(dat$AgeDam.mon., dat$BWT, xlab = "BWT", ylab = "Age of dams")
abline(lm(dat$BWT ~ dat$AgeDam.mon.), lwd = 3, col = "red")
lines(lowess(dat$AgeDam.mon., dat$BWT), lwd = 4, col = "blue")

Exercise 3

What is the correlation between age of dams and birth weights?

Birth weights and sex of calves

The boxplot() function generates distributions of birth weights for bulls and cows.

boxplot(dat$BWT[dat$SEX == "B"], dat$BWT[dat$SEX == "C"], names = c("Bulls", 
    "Cows"), ylab = "Birth weights")

Exercise 4

What is the mean difference between birth weights of bulls and cows?

Ordinary least squares

We will use the lm() function to fit OLS. This function offers an intuitive formula syntax. The summary() function summarizes a model fit.

fit <- lm(BWT ~ AgeDam.mon. + factor(SEX), data = dat)
summary(fit)

Save R objects

Use the save() function to save variables as R objects.

save(dat, file = "dat_day17.Rda")
save(fit, file = "fit_day17.Rda")

Exercise 5

Comfirm the estimates of p-values from t-statistics by using the pt function. The pt() function gives the probability density function.

Exercise 6

Explain why the degrees of freedom is 997.

Exercise 7

Comfirm the residual standard error by using the predict() function. This function returns \(\hat{\mathbf{y}}\). Note that the residual standard error is given by \[ \begin{align*} \hat{\sigma}^2_e &= \frac{\sum_{i=1}^{n} (y_i - \sum_{j=1}^{m} x_{ij}\hat{b}_{j})^2 }{n - k - 1} \\ &= \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }{n - k - 1}. \\ \end{align*} \]

Exercise 8

Compute the multiple R-squared (\(R^2\)) by yourself. The multiple R-squared indicates the proportion of explained variance. It is given by the squared correlation between the observed and predicted values. \[ R^2 = cor(\mathbf{y}, \hat{\mathbf{y}})^2 \]

Exercise 9

Compute the adjusted multiple R-squared by yourself. The adjusted multiple R-squared is given by \[ R^2_{adj} = 1 - (1 - R^2) \left(\frac{n-1}{n - k - 1} \right). \]

Gota Morota

March 7, 2017