Wayne's Github Page

A place to learn about statistics

Recreating Fisher’s ANOVA experiment in Spreadsheets vs R

Learning objectives

The first randomized controlled experiment with more than one treatment!

The beginning of modern statistics:

fisher's experiment

Warm-up tasks, recreating historical calculations

Key programming skill: Working backwards from the specifications

Requirement:

Start with drawing a diagram going backwards from the desired final outcome.

Concepts required!

Reading in the data

Place the downloaded data into your Download folder

df <- read.csv("~/Downloads/fisher_1927_grain.csv",          
               na.strings="")

Here we’ve already used several concepts:

Side Note - common mistakes when reading in data

How could you stress test this piece of code so far?

df often stands for “data frame”

df <- read.csv("~/Downloads/fisher_1927_grain.csv",
               na.strings="")

Notice what errors might be returned:

How to examine a data frame

dim(df)
head(df)
colnames(df)
rownames(df)

Why not examine all the data?

df
print(df)

Properties of a data frame

Subsetting a column using a single value

Assume we’re working with the timing column

Entered nothing before the comma implies “all rows”, i.e. no filtering/subsetting.

Which method of subsetting is preferable?

Subsetting a row using a single value

Similarly, we can subset a row out of the data frame

Subsetting columns using multiple values (vectors)

We want to only work with the part under “blockX” in the data frame. To do this, we can subset using

Subsetting rows using multiple values (vectors)

It is possible to use the same methods as when subsetting for columns but that is not recommended!

Properties of vectors

How to stress test vector creation?

-5:2
2:-5
c("1", 2, 3)
(1, 2, 3)
demo_vec <- 1:3
length(demo_vec)

For-loop - avoid repetition in “blockX”?

For-loops are an easy way to repeat steps efficiently

block_cols <- c()
for(i in 1:8){
  block_cols[i] <- paste0('block', i)
}
block_cols

Exercise - getting the sum of each row using a for-loop?

To get the sum of each row, it is best to break up the steps:

Compare your results with your Spreadsheet results afterwards.

Subsetting vectors

Next we only want the values corresponding to the “early” treatments. To do this, we want to subset the sum of each row from above using the treatment information. The sum of each row should be in a vector if you followed the examples.

You can subset vectors similarly as data frames using other vectors

vec_demo <- 1:5
vec_demo[3]
vec_demo[vec_demo == 3]
vec_demo[vec_demo >= 3]

Overview of boolean operations

Code Operation Example
> greater vec_demo <- 1:5
vec_demo > 2
== equal vec_demo <- c("A", "B", "B")
vec_demo == "A"
>= greater or equal vec_demo <- 1:5
vec >= 2
!= Not equal vec_demo <- c("A", "B", "B")
vec_demo != "A"
& and vec_demo <- 1:5
(vec_demo > 2) & (vec_demo >= 2)
| or vec_demo <- 1:5
(vec_demo > 2) | (vec_demo >= 2)

Exercise - getting the different row sums

How can we add up the row sums corresponding to just the early treatments? Note, for the treatment to be early, there must be at least 1 application of fertilizer as well.

Try repeating this for the late treatments as well