Recreating Fisher’s ANOVA experiment in Spreadsheets vs R
Learning objectives
- Breakdown steps required in standard data filter then aggregate tasks
The first randomized controlled experiment with more than one treatment!
The beginning of modern statistics:
- Studies in crop variation: IV The experimental determination of the value of top dressing with cereals by T. Eden, R. A. Fisher, 1927, The Journal of Agricultural Science
Warm-up tasks, recreating historical calculations
- Task1: Try recreating Fisher’s results with his straw data (in 0.5 of pounds) and grain data in 1/8 of pounds using Spreadsheets (i.e. Google Drive).
- Task2: Try recreating the first 3 differences for grain
Key programming skill: Working backwards from the specifications
Requirement:
- Reproduce the first 3 numbers behind Table III for grain.
Start with drawing a diagram going backwards from the desired final outcome.
Concepts required!
- Aggregating data
- filtered specific rows
- subsetting specific columns
- repeating similar calculations
- reading in data
Reading in the data
Place the downloaded data into your Download
folder
df <- read.csv("~/Downloads/fisher_1927_grain.csv",
na.strings="")
Here we’ve already used several concepts:
- Created a variable called
df
- Assigned the output from
read.csv()
to a variable calleddf
- Passed multiple inputs to the function
read.csv()
Side Note - common mistakes when reading in data
- Your working directory (often abbreviated as “wd”) is not where you think it is.
Be sure you know which folder R is operating from in your computer
getwd() list.files() # you can change the wd to the Desktop using the following setwd("~/Desktop")
- Misspelling file names, copy/paste can be your friend
df <- read.csv("~/Downloads/fisher_7291_grain.csv") list.files()
How could you stress test this piece of code so far?
df often stands for “data frame”
df <- read.csv("~/Downloads/fisher_1927_grain.csv",
na.strings="")
Notice what errors might be returned:
- double quotes to single quotes
- Not closing the parentheses
- Capitalization of the function name
- using
=
instead of<-
- Changing the variable name
- …
How to examine a data frame
dim(df)
head(df)
colnames(df)
rownames(df)
Why not examine all the data?
df
print(df)
Properties of a data frame
- Rectangular, i.e. each column has the same number of rows
- Different columns can have different “types” of data
- Column names are often meaningful
- Rows often represent different records
Subsetting a column using a single value
Assume we’re working with the timing
column
- Using indices
timing <- df[, 3]
- Using column names
timing <- df[, "timing"]
Entered nothing before the comma implies “all rows”, i.e. no filtering/subsetting.
Which method of subsetting is preferable?
Subsetting a row using a single value
Similarly, we can subset a row out of the data frame
- Using numeric indices
row <- df[1, ]
What happens when you use a negative index? (unique to R!)
- Using character values
rownames(df) row <- df["3", ]
Again, nothing after the comma implies “all columns”
Subsetting columns using multiple values (vectors)
We want to only work with the part under “blockX” in the data frame. To do this, we can subset using
- Column names (character vector)
block_cols <- c("block1", "block2", "block3", "block4", "block5", "block6", "block7", "block8") yield <- df[, block_cols]
- Indices of the columns (numerical vector)
block_cols <- 4:11 yield <- df[, block_cols]
- TRUE/FALSE vectors (boolean vector)
block_cols <- grepl("block", colnames(df)) yield <- df[, block_cols]
- Note that
block_cols
is a vector of 3 different types of data in the 3 examples above
Subsetting rows using multiple values (vectors)
- Using boolean vectors
fertilized <- df[, "top_dressing"] > 0 df_fert <- df[fertilized, ]
It is possible to use the same methods as when subsetting for columns but that is not recommended!
Properties of vectors
- One dimensional with finite elements, check the length using
length(VEC)
- Can only have one type of data, check the type using
class(VEC)
- Can be constructed using
c(1, 2, -5)
- A convenient short hand for a vector of consecutive integers is
START:END
- To insert values into a vector, you can
place_holder <- c() place_holder[1] <- pi place_holder[3] <- -pi
How to stress test vector creation?
-5:2
2:-5
c("1", 2, 3)
(1, 2, 3)
demo_vec <- 1:3
length(demo_vec)
For-loop - avoid repetition in “blockX”?
For-loops are an easy way to repeat steps efficiently
block_cols <- c()
for(i in 1:8){
block_cols[i] <- paste0('block', i)
}
block_cols
Exercise - getting the sum of each row using a for-loop?
To get the sum of each row, it is best to break up the steps:
- To get the sum of each row, we first need to sum of the first row. A for-loop should help us repeat the calculation for the other rows.
-
To get the sum of a collection of numbers, pass them into
sum()
For example:
sum(1:100)
-
- To get the sum of the first row, we first need to get the data corresponding to the first row then add them up.
- To get the first row data, we need to subset only the columns corresponding to “blockX” and the row index = 1.
Compare your results with your Spreadsheet results afterwards.
Subsetting vectors
Next we only want the values corresponding to the “early” treatments. To do this, we want to subset the sum of each row from above using the treatment information. The sum of each row should be in a vector if you followed the examples.
You can subset vectors similarly as data frames using other vectors
vec_demo <- 1:5
vec_demo[3]
vec_demo[vec_demo == 3]
vec_demo[vec_demo >= 3]
Overview of boolean operations
Code | Operation | Example |
---|---|---|
> |
greater | vec_demo <- 1:5 vec_demo > 2 |
== |
equal | vec_demo <- c("A", "B", "B") vec_demo == "A" |
>= |
greater or equal | vec_demo <- 1:5 vec >= 2 |
!= |
Not equal | vec_demo <- c("A", "B", "B") vec_demo != "A" |
& |
and | vec_demo <- 1:5 (vec_demo > 2) & (vec_demo >= 2) |
| |
or | vec_demo <- 1:5 (vec_demo > 2) | (vec_demo >= 2) |
Exercise - getting the different row sums
How can we add up the row sums corresponding to just the early
treatments?
Note, for the treatment to be early, there must be at least 1 application of fertilizer as well.
Try repeating this for the late
treatments as well