Wayne's Github Page

A place to learn about statistics

Quickly summarizing data

Besides visualizing the data, we often want to summarize the data across different dimensions as well.

To demonstrate this, we will re-introduce the classic grain harvest data by Fisher. This dataset contains the information from a single year of harvest under different fertilizer treatments: amount, timing, and type of fertilizer.

What’s unique about this experiment is that Fisher also grouped the different treatment plots into blocks on the field such that the different fertility levels from different parts of the field would be controlled between treatments.

Exploring the dataset

grain <- read.csv("fisher_1927_grian.csv")

dim(grain)

grain

What to notice?

Calculate summary statistics along different dimensions using apply()

To compare the different blocks, we could calculate the total yields for each block. First try doing this with a for-loop:

block_columns <- paste0('block', 1:8)
block_totals <- rep(NA, length(block_columns))
for(block in seq_along(block_columns)){
    block_totals[i] <- sum(grain[, block])
}
names(block_totals) <- block_columns
block_totals

We used subsetting by character values to grab the different columns from the data frame then changed the name of the vector to align with the different columns.

Turns out we can do this in one line using apply()

block_totals <- apply(grain[, block_columns], 2, sum)
block_totals

What to notice?

Elements of apply()

If you look at the documentation for apply(), you’ll see something like apply(X, MARGIN, FUNCTION, ...)

Below are 2 different visualizations for how apply() works: Applying a function across the data in the rows apply on the rows

Applying a function across the data in the columns apply on the columns

Exercise

Apply a function on records that satisfy a condition

For the Fisher dataset, Fisher also wants to calculate the total yield from all the plots that had “sulphate” vs “muriate”. This way we can calculate the difference to then derive the average treatment effect.

To accomplish this calculation, we can first calculate the row totals then perform a for-loop

block_columns <- paste0('block', 1:8)
row_totals <- apply(grain[, block_columns], 1, sum)
fert_types <- unique(grain$fertilizer_type)
trt_totals <- rep(0, length(fert_types))
for(i in seq_along(fert_types)){
    condition <- grain$fertilizer_type == fert_types[i]
    trt_totals[i] <- sum(row_totals[condition])
}
names(trt_totals) <- fert_types

Notice that the “non-treatment” case also is captured.

Turns out this also can be done with one line.

block_columns <- paste0('block', 1:8)
row_totals <- apply(grain[, block_columns], 1, sum)
trt_totals <- tapply(row_totals, grain$fertilizer_type, sum)
trt_totals

Similar to apply(), tapply() is a different function that applies the function to a particular chunk of the data. It also passes the names from the conditions to the final output.

You can also get the totals by cross multiple factors

block_columns <- paste0('block', 1:8)
row_totals <- apply(grain[, block_columns], 1, sum)
trt_totals <- tapply(row_totals, list(grain$fertilizer_type,
                                      grain$timing), sum)
trt_totals

Elements of tapply()

If you look at the documentation for tapply(), you’ll see something like tapply(X, INDEX, FUNCTION, ...)

Here’s a way to visualize how tapply() works: tapply visual

Exercise

Working on multiple columns using aggregate()

Notice how tapply() required us to calcuate the row totals first but it’s possible that we wanted to perform the calculation on each block separately.

To accomplish this, we can use aggregate()

block_columns <- paste0('block', 1:8)
block_trt_totals <- aggregate(grain[, block_columns], grain['fertilizer_type'], sum)
block_trt_totals

aggregate() is very similar to tapply() except that the data argument can be a matrix or data frame with multiple columns instead of just a single vector.

You can also pass multiple factors as well, please notice how the output is different from tapply()

block_columns <- paste0('block', 1:8)
block_trt_totals <- aggregate(grain[, block_columns],
                              grain[c('fertilizer_type', 'timing')],
                              sum)
block_trt_totals

Review