Module 0 - Objectives and Timeline

This Workshop in Data Science for the Biological Sciences is aimed at students and practitioners in the biological sciences who are interested in developing coding and data science skills to solve concrete problems emerging in their biological field of study.

By the end of the workshops, the participants will have gathered technical and theoretical skills in data science.

They will have learned how to:

Install R and R-studio.
Use the statistical software R to perform basic statistical analysis of a biological question.
Visualize the biological information hidden in the data under study.

Furthermore, they will have gained knowledge about:

Structure of different types of data.
Descriptive Statistics, including mean, standard deviation, confidence intervals, and probability distributions.
How to perform Hypothesis Tests, including t-test and ANOVA.
Difference between statistical and biological significance.

We might not be able to cover the full program in this workshop, but the extra material contained below can be used for further, self-directed learning.

Audience: This course is aimed at students and researchers interested in data science for biology. No pre-requisites are needed, but you might benefit from coding experience and an introductory course in statistics.

Location: This training will take at Purdue University Fort Wayne.

Dates and Time: Lecture 1: November 29th, 2022. Lecture 2: December 1st 2022. Time: Noon to 1:15pm.

Requirements: Participants must have access to a computer with a Mac or Windows operating system (not a tablet, Chromebook, etc.). It would help if R and R-studio software are already install in those computers, but we will go through the installation process during the workshop.

EDI: All these activities are inclusive and founded over positive values, such as equity. We will benefit from the participation of interested fellows from the most diverse backgrounds.

Contacts: Kathleen Lois Foster (klfoster86@bsu.edu) and Alessandro Maria Selvitella (aselvite@pfw.edu).

Timeline:

Day 1 (75 minutes):

0-15 minutes - Module 1.1 - Installation of R and R studio
15-25 minutes - Module 1.2 - R Warmup
25-35 minutes - Module 1.3 - Working Environment
35-45 minutes - Introduction to the Anolis Dataset
45-60 minutes - Module 1.4 - Simple Linear Regression
60-75 minutes - Module 1.5 - Data Visualization 1

Day 2 (75 minutes):

0-15 minutes - Background on types of data, descriptive statistics, and hypothesis testing
15-25 minutes - Module 2.1 - Multiple Linear Regression
25-35 minutes - Module 2.2 - T-test
35-45 minutes - Module 2.3 - One-way ANOVA
45-55 minutes - Module 2.4 - Two-way ANOVA
55-60 minutes - Module 2.5 - ANCOVA
60-75 minutes - Module 2.6 - Data Visualization 2

Day 1

Module 1.1 - Installation of R and R-studio

Module 1.2 - R Warmup

In this section, we will learn about variables and functions, data types and structures, and how to set up your working environment in R

Variables

Variables are like objects; they hold information. They can be very simple.

Value(s) are assigned to variables using the <- or =

x<-8 
x=8

You can print the value of a variable by typing it (mainly effective for simple variances)

## [1] 8

Functions

Functions do stuff. The basic form of a function is: output <- myFunction(argument1, argument2, etc)

Let’s make a variable called “numbers” that is a vector of the numbers 1 to 4.

numbers = 1:4

We can perform some basic functions on this variable, such as…

sum the vector elements:

sum(numbers)

## [1] 10

get the length of the vector:

length(numbers)

## [1] 4

concatenate (i.e. add additional values to the vector):

c(numbers,5,6,7,8)

## [1] 1 2 3 4 5 6 7 8

calculate the mean of the vector:

mean(numbers)

## [1] 2.5

calculate the standard deviation of the vector:

sd(numbers)

## [1] 1.290994

You can assign the output of a function to another variable to use later. Here we calculate the standard error of the mean of the vector ‘numbers’ and store the result in the variable ‘sem’:

sem<-sd(numbers)/sqrt(length(numbers))

You can do mathematical operations with a combination of variables and functions using the normal mathematical operators and brackets. Here we calculate the 95% confidence intervals of the mean of the vector ‘numbers’:

c(mean(numbers)-1.96*sem,mean(numbers)+1.96*sem) # 95% confidence intervals of the mean

## [1] 1.234825 3.765175

You can access the help file of a function by typing help(“functionName”) or searching in the lower-right panel of RStudio.

help("t.test")

Data Types and Data Structures

Information in R (and other programming languages) can be represented by different data types and data structures.

The previous examples were of numeric type.

Character type can be used to save information as a single character

char1<-'1'

or string of characters:

char2<-'hello'

We can concatenate these two variables into a single vector:

c(char1, char2)

## [1] "1"     "hello"

If you try to concatenate a character and numeric type together, they will both become characters:

c(1, char2)

## [1] "1"     "hello"

Logical types correspond to TRUE and FALSE

logi<-c(TRUE, FALSE)

Logical operators are used to evaluate true/false statements:

1<2

## [1] TRUE

2==1

## [1] FALSE

2!=1

## [1] TRUE

1>=1

## [1] TRUE

Factors are useful for grouping other variables (more on this later)

factor(c(0,0,0,1,1,1))

## [1] 0 0 0 1 1 1
## Levels: 0 1

The data structures you will use most often are vectors, matrices, and dataframes (more later).

Here are some examples of vectors:

vec1<-c(2, 4, 6)
vec2<-2:6
vec3<-seq(2, 3, by=0.5)
vec4<-rep(1:2, each=3)

Here is an example of how to make a matrix using the ‘matrix’ function: matrix(argument1, argument2, argument3). Here we specify the contents of the matrix in the first argument (i.e. numbers 1 to 9), the number of rows of the matrix in the second argument (i.e. 3), and the number of columns in the third matrix (i.e. 3):

mat1 <- matrix(1:9, nrow = 3, ncol = 3)
mat1

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Module 1.3: Working Environment and Datasets

Working Directory and Packages

Setting up a working directory makes it easy to organize your files. We do this using the ‘setwd’ function:

setwd('/Users/klfoster/Desktop/Rworkshop')

Packages allow you to download lots of different tools. You need to install packages before you can load them. Once they’re installed, you can run the following lines to load them into your R session so that the functions are available for you to use. Note that, although you only have to install packages once, you must re-run these lines every time you restart your R session in order to be able to use the functions they contain.

library(ggplot2)
library(car)

## Loading required package: carData

library(plyr)
library(MASS)
library(readxl)

Remember, if you don’t understand how to use a function in a package, use the ‘help’ function:

help("ggplot")

Datasets

We are going to use a piece of a biomechanical dataset on Anolis lizards obtained from Foster and Selvitella (2022). The complete reference for this resource is here:

Foster, K.L. and Selvitella, A.M. (2022). Transfer of Anolis locomotor behavior across environments and species. Integrative and Comparative Biology 62(3), 774–790. (https://doi.org/10.1093/icb/icac015).

Use the following line to load a .csv file containing your data into R:

anoleData <- data.frame(read.table(file='AnolisBiomechanicsData.csv', sep=',', header=TRUE))

Use the following line to load a .xlsx file containing your data into R:

anoleData <- data.frame(read_xlsx('AnolisBiomechanicsData.xlsx',sheet='Sheet1'))

You can refer to specific values in a dataset [row,column]:

anoleData[1,6]

## [1] 0.67748

or data ranges [rows,columns]:

anoleData[1:3,1:7]

##   Individual Body.Mass..kg. Animal.Length..m. Perch.Diameter Incline
## 1          1        0.00707             0.066              0       0
## 2          1        0.00707             0.066              0       0
## 3          1        0.00707             0.066              0       0
##   Stride.velocity..m.s. Duty.Factor
## 1               0.67748     0.65000
## 2               0.54028     0.63014
## 3               0.51291     0.50602

or (for datasets), specific variables, using their column names:

anoleData$Stride.velocity..m.s.

##  [1] 0.67748 0.54028 0.51291 0.56620 0.46852 0.37578 0.55394 0.41143 0.50626
## [10] 0.49874 0.67343 0.54037 0.39190 0.53999 0.12005 0.34676 0.31794 0.38834
## [19] 0.26505 0.21512 0.29521 0.65999 0.66576 0.88176 0.88953 0.71083 0.49636
## [28] 0.51963 0.47109 0.46899 0.27213 0.28983 0.24671 0.20531 0.18354 0.42694
## [37] 0.38353 0.15410 0.12002 0.40394 0.33275 0.31984 0.24056 0.31599 0.44797
## [46] 0.35004 1.05590 0.50766 0.54511 0.51137 0.46294 0.59481 0.69159 0.53520
## [55] 0.48156 0.87164 0.64537 0.31436 0.33885 0.34533 0.34834 0.55906 0.51746
## [64] 0.40416 0.38108 0.35828 0.30050 1.01970 0.96530 0.55085 0.39050 0.76025
## [73] 0.72737 0.65521 0.64639 0.78450 0.55487 0.46475 0.50627 0.49071 0.39104
## [82] 0.40320 0.44433 0.33434 0.41625 0.33260 0.16777 0.13257 0.10931

You can pull out data that you want to work with, for example by removing the first 5 columns, and naming that subset ‘data’, as we do below:

data <- anoleData[,-c(1:5)]

You can calculate summary statistics by group (in this case the group IDs are found in the column ‘Perch.Diameter’) using the plyr package:

s=ddply(anoleData,.(factor(Perch.Diameter)),summarise,
        n=length(Stride.velocity..m.s.),
        m=mean(Stride.velocity..m.s.,na.rm=T),
        std=sd(Stride.velocity..m.s.,na.rm=T),
        se=sd(Stride.velocity..m.s.,na.rm=T)/sqrt(n))

Show your summary statistics:

show(s)

##   factor(Perch.Diameter)  n         m       std         se
## 1                      0 40 0.5123663 0.2468515 0.03903065
## 2                      1 49 0.4323845 0.1491768 0.02131096

Module 1.4: Simple Linear Regression

If we consider statistical models with one single input variable $x$ and one single outcome variable $y$, other than the model in which $y$ is independent of $x$ ($y = c + \epsilon$, with $c$ constant independent of $x$), the simplest model is the model in which $y$ is connected to $x$ through a linear function: \[y=\beta_0+\beta_1 x + \epsilon.\]

This model is called simple linear regression model and assumes (i) a linear functional relationship between $x$ and $y$, (ii) that the $x$’s are non-random real-valued observations, (iii) that the observations $y_1, \dots, y_n$ are independent and identically distributed and so $E[y_i]=\mu$ and $Var(y_i)=\sigma^2$, (iv) the errors are normally distributed (bell curve). Note that this basic model might not have realistic assumptions (e.g. $x$’s are most likely random).

It will become more clear below, but the hypotheses tests that we will see below are based on these assumptions. In particular, the test statistics and the threshold for the rejection region take those forms because of mathematical theorems based on these assumptions. We will not go into the details of this during the workshop, but you are very welcome to ask further details anytime. What you will see are not magical formulas, they emerge from rigorous mathematical deductions based on these (i) to (iv) assumptions.

A linear regression model can be used to test whether an independent variable $x$ (here ‘Stride.Frequency..Hz.’) has an effect on a dependent variable $y$ (here ‘Stride.velocity..m.s.’). Note that both the input and output variables are continuous.

First, it can be helpful to quickly visualize your data:

plot(anoleData$Stride.Frequency..Hz.,anoleData$Stride.velocity..m.s.)

But be careful to not base your hypothesis on the data seen, otherwise you inflate the result. Build your hypothesis before doing your experiment!

Now, run the linear regression using the lm function. Note that many functions have a “data=” parameter. If you put the variable name of your dataset here, then you don’t have to use the $ for the column names that correspond to the input and output variables of your model.

model1<-lm(Stride.velocity..m.s.~Stride.Frequency..Hz., data=anoleData) 
summary(model1) # Print out results of the linear regression

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Stride.Frequency..Hz., data = anoleData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22051 -0.08489 -0.01674  0.05756  0.31562 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.010066   0.040408  -0.249    0.804    
## Stride.Frequency..Hz.  0.063402   0.005076  12.490   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1215 on 87 degrees of freedom
## Multiple R-squared:  0.642,  Adjusted R-squared:  0.6379 
## F-statistic:   156 on 1 and 87 DF,  p-value: < 2.2e-16

The first line of the code assigns to the variable model1 the output of the linear regression model lm which includes estimated coefficients, their basic statistics, and many other important information about the model fitted. The second line prints the summary of the main features of the model.

The the correlation coefficient is a scaled version of $\beta_1$ which measures the extent of linear relationship between two variables. It can be computed using the information from the summary: Take the sign of the coefficient $\beta_1$ and multiply it by the square root of the $R^2$: \[r=sign(\beta_1)*\sqrt{R^2}.\]

If we use as a measure, the correlation coefficient (a scaled version of $\beta_1$ which measures the extent of linear relationship between two variables), we note that there is a significant positive correlation between stride frequency and stride velocity.

Module 1.5 - Data Visualization - 1

For all of our graphs, we will be using the ‘ggplot2’ package, as this package gives us a great deal of control over the aesthetic features of the graphs. For each of the graphs, you will notice that we specify the variable name of the dataset (here anoleData), provide information about the independent and dependent variables that we would like to display, and our chosen aesthetic features.

Basic scatterplot

Create a basic scatterplot:

p=ggplot(anoleData,aes(x=Stride.Frequency..Hz., y=Stride.velocity..m.s.))+
  geom_point(color="blue")+
  theme(legend.position="none")
show(p)

If we want to add a regression line, we first need to run the statistics to get the parameters for the line we need to draw. This is done in the same way as we saw above in Part 2:

model1<-lm(Stride.velocity..m.s.~Stride.Frequency..Hz., data=anoleData) 
summary(model1) # Print out results of the linear regression

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Stride.Frequency..Hz., data = anoleData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22051 -0.08489 -0.01674  0.05756  0.31562 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -0.010066   0.040408  -0.249    0.804    
## Stride.Frequency..Hz.  0.063402   0.005076  12.490   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1215 on 87 degrees of freedom
## Multiple R-squared:  0.642,  Adjusted R-squared:  0.6379 
## F-statistic:   156 on 1 and 87 DF,  p-value: < 2.2e-16

Then, we extract the coefficients from our model and place them in the variable coefs1 to use later:

coefs1=unname(model1$coefficients)

Now, we add the line to our plot:

p=p+geom_abline(intercept=coefs1[1],slope=coefs1[2],colour='red')
show(p)

Finally, let’s modify other aesthetic properties of our graph, such as colors and labels for our categories, axes labels, font size, etc.

q=p+labs(x = "Stride Frequency (Hz)", y = "Stride Velocity (m/s)") +  #Label your axes
  theme_bw()+
  theme(panel.grid.major = element_blank(),                           #Take away the grid in the plot
        panel.grid.minor = element_blank(),               
        legend.text=element_text(size=11),                            #Change the size of the legend text (not the title)
        axis.title.y=element_text(vjust=2),                           #Change the position of the y axis title
        text=element_text(size=16, family="serif"),                   #Make the rest of the font bigger and Times New Roman
        plot.margin=unit(c(1,1,0.5,1.1),"cm"))                        #Change some margins so that things fit nicely.

show(q)

To save out a .png file of your beautiful graph simply specify the filename, background color (bg), and the desired width and height of your graph. It will save into the folder that you specified as your working directory at the beginning of Step One.

ggsave(q, filename = "freqVSvel.png",  bg = "transparent",width=8,height=7) #Save out your BEAUTIFUL plot.

Day 2

Module 2.1: Multiple Linear Regression

A multiple linear regression is a simple modification of the linear regression just discussed where you wish to include more than one input, continuous variable in your model. Instead of having a single $x$, you include many input variables $x_1, \dots, x_n$, and instead of having a line as a model, you will have a plane in a higher dimensional space (not anymore 2d!).

To do this, we use the same code of before but separate the variables with a + symbol:

model2<-lm(Stride.velocity..m.s.~Stride.Frequency..Hz.+Duty.Factor+Femur.retraction.angle..deg.,data=anoleData)
summary(model2)

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Stride.Frequency..Hz. + 
##     Duty.Factor + Femur.retraction.angle..deg., data = anoleData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25133 -0.05839 -0.02124  0.07474  0.33427 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.3728423  0.1255382   2.970  0.00387 ** 
## Stride.Frequency..Hz.         0.0575953  0.0051824  11.114  < 2e-16 ***
## Duty.Factor                  -0.4338074  0.1566616  -2.769  0.00690 ** 
## Femur.retraction.angle..deg. -0.0017316  0.0005245  -3.302  0.00141 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1129 on 85 degrees of freedom
## Multiple R-squared:  0.6978, Adjusted R-squared:  0.6872 
## F-statistic: 65.44 on 3 and 85 DF,  p-value: < 2.2e-16

Q: which variables $x$ show the highest association with $y$?

Module 2.2: t-test

A t-test tests whether there is a difference in the means of a output variable $y$ (here Stride.velocity..m.s.) split into two groups (here $0$ or $1$) using an input variable $x$ (here Perch.Diameter). Thus, $y$ contains continuous data, but the grouping variable $x$ is categorical.

The test is called in this way because the test statistics used to decide in favor or against the null hypothesis are distributed as a $Student-t$ (or simply $t$) distribution. Note that the $t$-distribution depends on a variable called degrees of freedom and so to determine which distribution must be used, you need to know the degrees of freedom as well.

This might be a source of confusion because many different tests are called t-test because are all based on the $Student-t$. Depending on the context, the quantity that you have to calculate and the degrees of freedom may vary. Be careful to which $t$-test you are using!

First, it can be useful to quickly visualize the data that you want to analyze:

plot(anoleData$Perch.Diameter, anoleData$Stride.velocity..m.s.)

Separate your $y$ (i.e. Stride.velocity..m.s.) according to $x$ (i.e. Perch.Diameter):

Perch0<-anoleData$Stride.velocity..m.s.[anoleData$Perch.Diameter==0] #pull out the Stride.velocity..m.s. values that correspond to Perch.Diameter=0 group
Perch1<-anoleData$Stride.velocity..m.s.[anoleData$Perch.Diameter==1] #pull out the Stride.velocity..m.s. values that correspond to Perch.Diameter=1 group

To run a t-test, we will use the t-test function using the following general call: newModel<-t.test(y-values for group 1, y-values for group 2, var.equal=TRUE/FALSE, paired = FALSE/TRUE).

Again, there are many t-test, depending on the assumptions of your model, whether you are performing a one-sided or two-sided test, and whether your data is paired. These can be specified in the arguments when you call the t-test function.

Two-sided t-test (hypothesis: Perch0 is different from Perch1), assuming equal variance between Perch0 and Perch1 groups:

ttest1<-t.test(Perch0,Perch1, var.equal=T, alternative="two.sided")
ttest1 #print out the results of this t-test

## 
##  Two Sample t-test
## 
## data:  Perch0 and Perch1
## t = 1.8863, df = 87, p-value = 0.06259
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.004295783  0.164259303
## sample estimates:
## mean of x mean of y 
## 0.5123663 0.4323845

One-sided t-test (hypothesis: Perch0 less than Perch1), assuming equal variance between Perch0 and Perch1 groups:

ttest2<-t.test(Perch0,Perch1, data=anoleData, var.equal=T, alternative="less")
ttest2

## 
##  Two Sample t-test
## 
## data:  Perch0 and Perch1
## t = 1.8863, df = 87, p-value = 0.9687
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf 0.1504767
## sample estimates:
## mean of x mean of y 
## 0.5123663 0.4323845

One-sided t-test (hypothesis: Perch0 greater than Perch1), assuming equal variance between Perch0 and Perch1 groups:

ttest3<-t.test(Perch0,Perch1, data=anoleData, var.equal=T, alternative="greater")
ttest3

## 
##  Two Sample t-test
## 
## data:  Perch0 and Perch1
## t = 1.8863, df = 87, p-value = 0.0313
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.009486859         Inf
## sample estimates:
## mean of x mean of y 
## 0.5123663 0.4323845

Two-sided t-test, approximation (Welch) for non-equal variances between Perch0 and Perch1 groups:

ttest4<-t.test(Perch0,Perch1, data=anoleData, var.equal=F)
ttest4

## 
##  Welch Two Sample t-test
## 
## data:  Perch0 and Perch1
## t = 1.7986, df = 61.294, p-value = 0.07701
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.008932166  0.168895687
## sample estimates:
## mean of x mean of y 
## 0.5123663 0.4323845

Paired t-test (to be used where the same subjects have data represented in both categories of the grouping variable):

ttest5<-t.test(Perch0,Perch1[1:40], paired=T)
ttest5

## 
##  Paired t-test
## 
## data:  Perch0 and Perch1[1:40]
## t = 1.4386, df = 39, p-value = 0.1582
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02638383  0.15635383
## sample estimates:
## mean of the differences 
##                0.064985

Module 2.3: One-Way ANOVA

An Analysis of Variance (ANOVA) tests whether there is a difference in the means of an output variable $y$ (here ‘Stride.velocity..m.s.’) split according to more than $2$ groups (here ‘0’, ‘45’, or ‘90’) within an input variable $x$ (here ‘Incline’). There are two different ways of running an ANOVA in R. They are equivalent.

The first way is to use the lm function from the stats package that is included in R by default.

model3<-lm(Stride.velocity..m.s.~Incline, data = anoleData)

After running the ANOVA and storing the results within the variable model3, we can print out an ANOVA table using different types of sums of squares. Unless you have a reason to do otherwise, the standard practice within Biology is to use the Type III sums of squares.

Type I sums of squares:

summary(model3)

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Incline, data = anoleData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.27770 -0.11290  0.00351  0.07756  0.43693 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.6189660  0.0252682  24.496  < 2e-16 ***
## Incline     -0.0035050  0.0004461  -7.857 9.67e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1553 on 87 degrees of freedom
## Multiple R-squared:  0.415,  Adjusted R-squared:  0.4083 
## F-statistic: 61.73 on 1 and 87 DF,  p-value: 9.67e-12

Type II sums of squares:

Anova(model3)

## Anova Table (Type II tests)
## 
## Response: Stride.velocity..m.s.
##           Sum Sq Df F value   Pr(>F)    
## Incline   1.4881  1  61.727 9.67e-12 ***
## Residuals 2.0974 87                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Type III sums of squares:

Anova(model3,type="III")

## Anova Table (Type III tests)
## 
## Response: Stride.velocity..m.s.
##              Sum Sq Df F value    Pr(>F)    
## (Intercept) 14.4661  1 600.048 < 2.2e-16 ***
## Incline      1.4881  1  61.727  9.67e-12 ***
## Residuals    2.0974 87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The second way to perform an ANOVA is to use the aov function, also found within the stats package:

model4<-aov(Stride.velocity..m.s. ~ Incline, data = anoleData)

Again, here we can print out an ANOVA table with Type III sums of squares for this model:

Anova(model4,type="III")

## Anova Table (Type III tests)
## 
## Response: Stride.velocity..m.s.
##              Sum Sq Df F value    Pr(>F)    
## (Intercept) 14.4661  1 600.048 < 2.2e-16 ***
## Incline      1.4881  1  61.727  9.67e-12 ***
## Residuals    2.0974 87                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Importantly, an ANOVA only tells us whether there is a difference somewhere, but it does not tell us where the differences occur. Usually, we want to know whether only one group different is from all the others or whether there are more differences between pairs of groups. To determine this, we perform post-hoc pair-wise t-tests, being sure to correct for higher false discovery rates from performing multiple tests (here using the Benjamini-Hochberg ‘BH’ correction):

pairwise.t.test(anoleData$Stride.velocity..m.s., anoleData$Incline, paired=FALSE, p.adjust.method="BH")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  anoleData$Stride.velocity..m.s. and anoleData$Incline 
## 
##    0       45   
## 45 6.4e-05 -    
## 90 4.1e-11 9e-04
## 
## P value adjustment method: BH

Module 2.4: Two-Way ANOVA

A two-way ANOVA is a simple modification from the one-way ANOVA in which there are two input variables $x_1, x_2$ rather than one. By including both input variables in your model at the same time, you can determine the effect of each one alone on the output variable $y$ of interest, as well as whether $x_1, x_2$ interact, if you choose to include an interaction term.

The * between the predictor variables Perch.Diameter and Incline in the linear model _Stride.velocity..m.s.~Perch.Diameter*Incline_ indicates that you would like to test for the interaction between these two input variables:

model5<-aov(Stride.velocity..m.s. ~ Perch.Diameter*Incline, data = anoleData)
Anova(model5,type="III")

## Anova Table (Type III tests)
## 
## Response: Stride.velocity..m.s.
##                        Sum Sq Df F value    Pr(>F)    
## (Intercept)            7.7429  1 364.766 < 2.2e-16 ***
## Perch.Diameter         0.2573  1  12.120 0.0007901 ***
## Incline                1.0310  1  48.569 6.384e-10 ***
## Perch.Diameter:Incline 0.0715  1   3.368 0.0699729 .  
## Residuals              1.8043 85                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If there is no significant interaction between $x_1, x_2$, it is common practice to re-run the model without the interaction term by replacing the * symbol with a + symbol:

model5<-aov(Stride.velocity..m.s. ~ Perch.Diameter + Incline, data = anoleData) 
Anova(model5, type="III")

## Anova Table (Type III tests)
## 
## Response: Stride.velocity..m.s.
##                 Sum Sq Df F value    Pr(>F)    
## (Intercept)    10.8007  1 495.183 < 2.2e-16 ***
## Perch.Diameter  0.2216  1  10.161  0.002001 ** 
## Incline         1.5689  1  71.928 5.599e-13 ***
## Residuals       1.8758 86                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Module 2.5: ANCOVA

In contrast with the two-way ANOVA, an Analysis of Covariance (ANCOVA) involves independent variables that are both continuous (here Stride.Frequency..Hz.) and categorical (here Perch.Diameter). As with the two-way ANOVA, we can choose to test for interaction between these input variables or not, by specifying a * symbol or a + symbol in our models.

First, let’s run the ANCOVA with the interaction term:

model6=lm(Stride.velocity..m.s.~Stride.Frequency..Hz.*Perch.Diameter,data=anoleData)
summary(model6)

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Stride.Frequency..Hz. * 
##     Perch.Diameter, data = anoleData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35857 -0.03892 -0.00019  0.04787  0.21760 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          -0.123469   0.042425  -2.910  0.00461 ** 
## Stride.Frequency..Hz.                 0.090370   0.005701  15.853  < 2e-16 ***
## Perch.Diameter                        0.139124   0.058980   2.359  0.02063 *  
## Stride.Frequency..Hz.:Perch.Diameter -0.038027   0.007518  -5.058 2.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08746 on 85 degrees of freedom
## Multiple R-squared:  0.8187, Adjusted R-squared:  0.8123 
## F-statistic: 127.9 on 3 and 85 DF,  p-value: < 2.2e-16

Even though there was a significant interaction, let’s try running the ANCOVA without the interaction term:

model7=lm(Stride.velocity..m.s.~Stride.Frequency..Hz.+Perch.Diameter,data=anoleData)
summary(model7)

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Stride.Frequency..Hz. + 
##     Perch.Diameter, data = anoleData)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.298048 -0.064420  0.007726  0.066679  0.229615 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            0.030366   0.033542   0.905    0.368    
## Stride.Frequency..Hz.  0.068506   0.004214  16.256  < 2e-16 ***
## Perch.Diameter        -0.143387   0.021490  -6.672 2.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09917 on 86 degrees of freedom
## Multiple R-squared:  0.7641, Adjusted R-squared:  0.7586 
## F-statistic: 139.3 on 2 and 86 DF,  p-value: < 2.2e-16

Q: What difference did you notice?

Module 2.6 - Data Visualization 2

Basic plot for a t-test

The data associated with a t-test is often visualized with a bar graph or a box plot. In a bar graph, the heights of the bars represent the means of the groups being compared, and the error bars around those means are usually standard errors of those means. There are a few ways to create a box plot, but they commonly contain the median, 1st and 3rd quartiles ($25\%$ and $75\%$ percentiles), and minimum and maximum data points of each group being compared.

Below is an alternative way of visualizing the data, by plotting the raw data points, along with the means and standard errors of the means of each group, represented by horizontal bars and error bars. The advantage of this type of plot is that the reader can see all the raw data at the same time as the commonly used descriptive statistics (e.g. mean and standard error) most frequently used and reported in studies in the biological sciences.

First, create a basic plot with just the data points for ‘Stride.velocity..m.s.’:

p=ggplot(anoleData,aes(x=factor(Perch.Diameter),y=Stride.velocity..m.s.))+
  geom_point(aes(color=factor(Perch.Diameter)))
show(p)

Now, add the means for the groups as horizontal bars and vertical error bars. Below is the code for adding error bars that represent standard errors of the means (fun.data=mean_se), but this can easily be changed to plot $95\%$ confidence intervals instead (fun.data=mean_cl_normal):

p=p+stat_summary(geom = "errorbar", 
               fun = mean,
               aes(ymin = ..y.., ymax = ..y..),
               width=0.75,size=1)+
  stat_summary(geom = "errorbar",
               fun.data=mean_se,
               width=0.2)
show(p)

Finally, let’s modify other aesthetic properties of our graph as we did with the scatterplot above:

q=p+scale_color_manual(values=c('red','blue'),
                     labels=c('Broad','Narrow'),
                     name=c('Perch Diameter'))+                       #Change the color and labels for your categories.
  labs(x = "Perch Diameter", y = "Stride Velocity (m/s)") +           #Label your axes
  theme_bw()+
  theme(panel.grid.major = element_blank(),                           #Take away the grid in the plot
        panel.grid.minor = element_blank(),               
        legend.text=element_text(size=11),                            #Change the size of the legend text (not the title)
        axis.title.y=element_text(vjust=5),                           #Change the position of the y axis title
        axis.text.x=element_blank(),                                  #Get rid of your x axis labels
        text=element_text(size=16, family="serif"),                   #Make the rest of the font bigger and Times New Roman
        plot.margin=unit(c(1,1,0.5,1.1),"cm"))                        #Change some margins so that things fit nicely.

show(q)

Basic plot for a one-way ANOVA

Below is the code for creating a plot that would accompany a one-way ANOVA (here comparing velocities on different inclines), in the same style as the one we just created for a t-test. As before, the raw data points for ‘Stride.velocity..m.s.’ are plotted with mean and standard error bars for each of the categories (0, 45, and 90 deg inclines) from the ‘Incline’ variable.

ggplot(anoleData,aes(x=factor(Incline),y=Stride.velocity..m.s.))+
  geom_point(aes(color=factor(Incline)))+
  stat_summary(geom = "errorbar", 
               fun = mean,
               aes(ymin = ..y.., ymax = ..y..),
               width=0.75,size=1)+
  stat_summary(geom = "errorbar",
               fun.data=mean_se,
               width=0.2)+
  scale_color_manual(values=c('red','blue','green'),
                     labels=c('0 deg','45 deg','90 deg'),
                     name=c('Incline'))+                              #Change the color and labels for your categories.
  labs(x = "Incline", y = "Stride Velocity (m/s)") +                  #Label your axes
  theme_bw()+                                                         #Change the "theme" to a prettier one.
  theme(panel.grid.major = element_blank(),                           #Take away the grid in the plot
        panel.grid.minor = element_blank(),               
        legend.text=element_text(size=12),                            #Change the size of the legend text (not the title)
        axis.title.y=element_text(vjust=5),                           #Change the position of the y axis title
        axis.title.x=element_blank(),                                 #Get rid of your x axis title
        text=element_text(size=16, family="serif"),                   #Make the rest of the font bigger and Times New Roman
        plot.margin=unit(c(1,1,0.5,1.1),"cm"))                        #Change some margins so that things fit nicely.

Basic plot for a two-way ANOVA

Below is the code for creating a plot that would accompany a two-way ANOVA (here comparing velocities on different inclines and perch diameters), in the same style as the one we just created for the one-way ANOVA. As before, the raw data points for ‘Stride.velocity..m.s.’ are plotted with mean and standard error bars for each of the categories (0, 45, and 90 deg inclines and broad vs narrow perch diameters) from the ‘Incline’ and ‘Perch.Diameter’ variables.

ggplot(anoleData,aes(x=factor(Incline),y=Stride.velocity..m.s.,color=factor(Perch.Diameter)))+
  geom_point(aes(color=factor(Perch.Diameter)),position=position_dodge(0.9))+
  stat_summary(geom = "errorbar", 
               fun = mean,
               aes(ymin = ..y.., ymax = ..y..),
               width=0.75,size=1,
               position=position_dodge(0.9))+
  stat_summary(geom = "errorbar",
               fun.data=mean_se,
               width=0.2,
               position=position_dodge(.9))+
  scale_color_manual(values=c('red','blue'),
                     labels=c('Broad','Narrow'),
                     name=c('Perch Diameter'))+                       #Change the color and labels for your categories.
  scale_x_discrete(name='Incline',
                   breaks=c('0','45','90'),                           #Change the labels on your x axis.
                   labels=c('0 deg','45 deg','90 deg'))+                  
  labs(x = "Incline", y = "Stride Velocity (m/s)") +                  #Label your axes
  theme_bw()+                                                         #Change the "theme" to a prettier one.
  theme(panel.grid.major = element_blank(),                           #Take away the grid in the plot
        panel.grid.minor = element_blank(),               
        legend.text=element_text(size=12),                            #Change the size of the legend text (not the title)
        axis.title.y=element_text(vjust=5),                           #Change the position of the y axis title
        text=element_text(size=16, family="serif"),                   #Make the rest of the font bigger and Times New Roman
        plot.margin=unit(c(1,1,0.5,1.1),"cm"))                        #Change some margins so that things fit nicely.

Basic plot for an ANCOVA

In contrast with the plot to accompany a two-way ANOVA, the plot for an ANCOVA involves input variables that are both continuous and categorical. Below we go through the steps to make a graph that plots the continuous input variable against the continuous output variable, with different colors representing the groups of the categorical input variable.

First, we need to perform ANCOVA so that the statistical output can be used for the graph. This is the same process as used above in Part 7:

mod1=lm(Stride.velocity..m.s.~Stride.Frequency..Hz.*factor(Perch.Diameter),data=anoleData) #ANCOVA with the interaction term
summary(mod1)

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Stride.Frequency..Hz. * 
##     factor(Perch.Diameter), data = anoleData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35857 -0.03892 -0.00019  0.04787  0.21760 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -0.123469   0.042425  -2.910
## Stride.Frequency..Hz.                          0.090370   0.005701  15.853
## factor(Perch.Diameter)1                        0.139124   0.058980   2.359
## Stride.Frequency..Hz.:factor(Perch.Diameter)1 -0.038027   0.007518  -5.058
##                                               Pr(>|t|)    
## (Intercept)                                    0.00461 ** 
## Stride.Frequency..Hz.                          < 2e-16 ***
## factor(Perch.Diameter)1                        0.02063 *  
## Stride.Frequency..Hz.:factor(Perch.Diameter)1 2.41e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08746 on 85 degrees of freedom
## Multiple R-squared:  0.8187, Adjusted R-squared:  0.8123 
## F-statistic: 127.9 on 3 and 85 DF,  p-value: < 2.2e-16

mod2=lm(Stride.velocity..m.s.~Stride.Frequency..Hz.+factor(Perch.Diameter),data=anoleData) #ANCOVA without the interaction term
summary(mod2)

## 
## Call:
## lm(formula = Stride.velocity..m.s. ~ Stride.Frequency..Hz. + 
##     factor(Perch.Diameter), data = anoleData)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.298048 -0.064420  0.007726  0.066679  0.229615 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              0.030366   0.033542   0.905    0.368    
## Stride.Frequency..Hz.    0.068506   0.004214  16.256  < 2e-16 ***
## factor(Perch.Diameter)1 -0.143387   0.021490  -6.672 2.33e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09917 on 86 degrees of freedom
## Multiple R-squared:  0.7641, Adjusted R-squared:  0.7586 
## F-statistic: 139.3 on 2 and 86 DF,  p-value: < 2.2e-16

Then, we extract the coefficients from our models and place them in two variables (coefs1 and coefs2) to use later:

coefs1=unname(mod1$coefficients)
coefs2=unname(mod2$coefficients)

Make a graph using the results of the ANCOVA with the interaction term:

ggplot(anoleData, aes(x=Stride.Frequency..Hz., y=Stride.velocity..m.s., colour=factor(Perch.Diameter))) +
  geom_point(size=3) +
  geom_abline(intercept=coefs1[1],slope=coefs1[2],colour='red')+
  geom_abline(intercept=coefs1[1]+coefs1[3],slope=coefs1[2]+coefs1[4],colour='blue')+
  scale_color_manual(values=c('red','blue'),
                     labels=c('Broad','Narrow'),
                     name=c('Perch Diameter'))+                       #Change the color and labels for your categories.
  labs(x = "Stride Frequency (Hz)", y = "Stride Velocity (m/s)") +    #Label your axes
  theme_bw()+                                                         #Change the "theme" to a prettier one.
  theme(panel.grid.major = element_blank(),                           #Take away the grid in the plot
        panel.grid.minor = element_blank(),               
        legend.text=element_text(size=12),                            #Change the size of the legend text (not the title)
        text=element_text(size=16, family="serif"),                   #Make the rest of the font bigger and Times New Roman
        plot.margin=unit(c(1,1,0.5,1.1),"cm"))                        #Change some margins so that things fit nicely.

Make a graph using the results of the ANCOVA with the interaction term:

ggplot(anoleData, aes(x=Stride.Frequency..Hz., y=Stride.velocity..m.s., colour=factor(Perch.Diameter))) +
  geom_point(size=3) +
  geom_abline(intercept=coefs2[1],slope=coefs2[2],colour='red')+
  geom_abline(intercept=coefs2[1]+coefs2[3],slope=coefs2[2],colour='blue')+
 scale_color_manual(values=c('red','blue'),
                     labels=c('Broad','Narrow'),
                     name=c('Perch Diameter'))+                       #Change the color and labels for your categories.
  labs(x = "Stride Frequency (Hz)", y = "Stride Velocity (m/s)") +    #Label your axes
  theme_bw()+                                                         #Change the "theme" to a prettier one.
  theme(panel.grid.major = element_blank(),                           #Take away the grid in the plot
        panel.grid.minor = element_blank(),               
        legend.text=element_text(size=12),                            #Change the size of the legend text (not the title)
        text=element_text(size=16, family="serif"),                   #Make the rest of the font bigger and Times New Roman
        plot.margin=unit(c(1,1,0.5,1.1),"cm"))                        #Change some margins so that things fit nicely.

Another Day…

Module 2.7: Chi-Squared Test of Independence

A chi-squared test of independence tests whether there is a significant relationship between two categorical variables.

First, we create a contingency table:

tbl<-table(anoleData$Incline,anoleData$Perch.Diameter)
tbl # Print the contingency table

##     
##       0  1
##   0  12 20
##   45 15 14
##   90 13 15

Next, we perform a chi-squared test of independence

model8<-chisq.test(tbl)
model8 # Print out results of the chi-squared test of independence

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 1.2803, df = 2, p-value = 0.5272

Q: Is there independence or not?

Module 2.8: Non-Parametric Tests

When the data you wish to analyse do not satisfy the assumptions required for the parametric tests described earlier, you can use non-parametric alternatives. In these tests, we do not assume that the models are normally distributed. We have to discover not only mean or variance of a distribution, but also what functional shape the distribution under study has! A couple common non-parametric tests are below.

To perform a Man-Whitney-Wilcoxon test, a non-parametric version of the t-test, we use the wilcox.test function:

model9<-wilcox.test(Stride.velocity..m.s.~Perch.Diameter, data=anoleData)
model9

## 
##  Wilcoxon rank sum exact test
## 
## data:  Stride.velocity..m.s. by Perch.Diameter
## W = 1154, p-value = 0.1531
## alternative hypothesis: true location shift is not equal to 0

To perform a Kruskal-Wallis rank sum test, a non-parametric version of the one-way ANOVA, we use the kruskal.test function:

model10<-kruskal.test(Stride.velocity..m.s.~Incline, data=anoleData)
model10

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Stride.velocity..m.s. by Incline
## Kruskal-Wallis chi-squared = 41.598, df = 2, p-value = 9.271e-10

Useful References:

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. (2nd Ed). New York: Springer Nature.

McCune, B. and Grace, J.B. (2002). Analysis of Ecological Communities. Gleneden Beach: MjM Software Design.

Data Science for the Biological Sciences

Kathleen Lois Foster and Alessandro Maria Selvitella

11/17/2022