Wayne's Github Page

A place to learn about statistics

Data visualization

The next task we need to learn is how to plot the data. Data visualization is a powerful technique that often highlights useful patterns in the data. We will specifically try to plot the different trajectory of corn production for different states over the years.

To do this, we will need to learn about

Subsetting

Before, we taught about subsetting vectors using numerical vectors, e.g.

arbitrary_data <- 10:15
print(arbitrary_data[2])
print(arbitrary_data[2:4])

But turns out you can subset using boolean vectors and character vectors. We show some examples below:

arbitrary_data <- 10:15
names(arbitrary_data) <- c("a", "b", "x", "y", "z")
print(arbitrary_data)

# Subsetting with character vectors
print(arbitrary_data['a'])
print(arbitrary_data[c('b', 'x', 'z')])

# Subsetting with boolean vectors
bool_vec <- c(FALSE, TRUE, TRUE, FALSE, TRUE)
print(arbitrary_data[bool_vec])

Things to notice:

Quick note on data types

It is important to know data types in programming because

What are character vectors?

Character vectors are composed of character strings like "a", "hello", c("statistical", "computing"), etc

Characters values can be used to change axis labels, create file names dynamically, or find keywords in job descriptions.

A common function we’ll use with characters is paste0() that combines different characters together.

alphas <- c('a', 'b', 'c')
paste0('file_', alphas, '.csv')

You can also give names to different elements

num_rolls <- 5
coin_tosses <- sample(c(1, 0), num_rolls, replace=TRUE)
names(coin_tosses) <- paste0('toss', 1:num_rolls)
print(coin_tosses)
print(coin_tosses['toss5'])

Exercises

What are boolean vectors?

Boolean values are TRUE or FALSE values.

These are often created as a result of a logical statements like 1 < 2

Boolean values can be used to identify outliers, filter data that belongs to a particular group (e.g. country), or control the flow of code (we’ll explain this later).

When using booleans to subset another vector, only the elements corresponding to the TRUE values will be kept

nums <- 1:5
nums[c(FALSE, FALSE, TRUE, TRUE, TRUE)]

One important feature about boolean values is that they behave like 0 or 1 values when we perform arithmetic with them.

TRUE + FALSE
TRUE * FALSE
FALSE + FALSE
TRUE * 3
sum(c(TRUE, TRUE, FALSE))

Exercises

Vectorized operations with vectors

It is common to operate between a vector and a single value. For example, checking which numbers are larger than a certain value.

nums <- 1:5
larger_than_3 <- nums > 3
print(larger_than_3)

In the code above, nums is a vector of length 5. When compared to 3, a constant, the comparison was carried out between each element in nums and the value 3 as in the following code:

larger_than_3 <- c(1 > 3, 2 > 3, 3 > 3, 4 > 3, 5 > 3)
print(larger_than_3)

The distribution of the operation across the vector is a form of the vectorized operation. This can happen with other operations too:

nums <- 1:5
print(nums - 3)
print(nums * -1)

Why bother with vectorized operations?

Exercises

Other Boolean Operators

The most common logical statements are:

Code Operation Example
! negation of !TRUE
> greater vec_demo <- 1:5
vec_demo > 2
== equal vec_demo <- c("A", "B", "B")
vec_demo == "A"
>= greater or equal vec_demo <- 1:5
vec >= 2
<= less or equal vec_demo <- 1:5
vec <= 2
!= Not equal vec_demo <- c("A", "B", "B")
vec_demo != "A"
& and vec_demo <- 1:5
(vec_demo > 2) & (vec_demo >= 2)
| or vec_demo <- 1:5
(vec_demo > 2) | (vec_demo >= 2)

For the & and | operation, it’s especially important to understand:

Exercises

Vectors again!

Recall that vectors are a collection of data with a type and length. We mentioned that the data within a vector had to all belong to the same type before. We show that again here

num_demo <- 1
demo_vec <- c(num_demo, 'hello')
class(num_demo) == class(demo_vec[1])

Exercises

Special data types - missing values

A very special type of data is the missing data type and data types that are easily mistaken as missing values.

The missing value data type is NA in R. Properties of NA include:

What’s the point of missing values?

Missing values are great defaults because no data can be better than bad data.

For example

Next most common collective data type - Data Frames

Data frames are most often thought of a collection of vectors, each capable of being a different type.

student_roster <- data.frame(
    student_id = 1:3,
    family_name = c("Doe", "Lee", "Liang"),
    given_name = c("John", "Billy", "Sally"),
    dropped = c(TRUE, FALSE, FALSE),
    stringsAsFactors=FALSE)
print(student_roster)
print(class(student_roster))
print(colnames(student_roster)) # colnames = column names
print(dim(student_roster)) # dim = dimension
print(length(student_roster))

Above, we create a data frame named student_roster with 4 different columns, 1 numeric, 2 character, and 1 boolean vector. The argument stringsAsFactors=FALSE was made to ensure character values remained characters because the default in R is to convert them into factors a different type of data we will introduce later.

Some things to note:

Exercises

Subsetting different columns and rows

Similar to vectors, you can subset data frames using the [] operator but with some modifications.

As a practice, try walking through the code below, line by line, to guess what’s happening before being told what’s happening!

student_roster <- data.frame(
    student_id = 1:3,
    family_name = c("Doe", "Lee", "Liang"),
    given_name = c("John", "Billy", "Sally"),
    dropped = c(TRUE, FALSE, FALSE),
    stringsAsFactors=FALSE)

# To get the 2nd column with integers
student_roster[, 2]
# To get the 2nd row with integers
student_roster[2, ]

# To subset by column with character vectors
student_roster[, c('family_name', 'given_name')]
# A common alternative if you only need one column is to use "$" followed
# by the column name without quotes
student_roster$family_name

# To subset using booleans, e.g. those who have NOT dropped the class
dropped_class <- student_roster[, "dropped"]
student_roster[!dropped_class, ]

Things to notice:

Exercises

Reading data from existing files

The most common way to get a data frame is actually by reading in an existing file like fisher_1927_grain.csv that contains the data from 1927 harvests with different treatments.

Download the fisher_1927_grain.csv file. The function to read in this type of data is read.csv()

df <- read.csv("~/Downloads/fisher_1927_grain.csv")

A few things to know:

Common errors when loading data

If you encoutered an issue when you tried to read in the file, here are some common mistakes that beginners do:

Arguments in read.csv()

You should look at ?read.csv to see the type of arguments that can change how the program reads your file. The function relies on the file to be properly formatted in a certain way to help differentiate different data values, numbers vs characters, and column headings etc. But if the file is formatted incorrectly, usually you can change a few settings by changing some arguments.

The most popular arguments are:

Exploring Larger Data Frames

In general, it’s never good practice to “see all the data” as we do in Excel. With large datasets, I recommend you to use these functions:

These are usually sufficient for you to start plotting for better understanding of the data.

The dataset - corn yields

We will plot the corn yields over time. First, we need the data by downloading the file: usda_i_state_corn_yields_cleaned.csv.

Follow the code below to get the recent (after year 2000) corn yields from Idaho.

df <- read.csv("usda_i_state_corn_yields_cleaned.csv")
states <- df$state_name
years <- df$year
is_idaho <- states == "IDAHO"
after_2000 <- years > 2000

idaho <- df[is_idaho & after_2000, ]
dim(idaho)

A quick summary of the code above is:

Plotting

Data visualization is a field in itself so we will only cover the basics for scatter plots, using the plot() function.

plot(idaho$year, idaho$yield_bu_per_ac)

The code above generates a scatter plot where each point’s x value is the year of the record and its y value is the yield_bu_per_ac of the record. Yield is the amount of corn produced (in units of bushels, shortened as bu) over the area required to produce it (in units of acres, shortened as ac). This should produce a plot like below:

idaho_after_2000_no_label

What to notice:

Exercises

Axis labels and titles

The first thing to modify about a plot is the axis labels and title.

plot(idaho$year, idaho$yield_bu_per_ac,
     xlab='Year', ylab='Yield (bu/ac)',
     main='Idaho Corn Yields have Increased Since 2000')

idaho_after_2000_labeled

What to notice:

Overlaying data with points()

A common operation is to compare data across different sources on the same plot. We will add the data from Illinois to our plot above using points()

is_illinois <- states == "ILLINOIS"
after_2000 <- years > 2000

illinois <- df[is_illinois & after_2000, ]

plot(idaho$year, idaho$yield_bu_per_ac,
     xlab='Year', ylab='Yield (bu/ac)',
     main='Idaho Corn Yields have Increased Since 2000',
     col="blue")
points(illinois$year, illinois$yield_bu_per_ac,
       col="red")

idaho_with_illinois_after_2000

What to notice?

Exercises

Range of data

If you quickly check the range of the Illinois data, you’ll notice that its lowest value is much lower than the Idaho values. This suggests that the plot above is censoring some of the data because its range is limited by the Idaho data.

Turns out plot() allows you to tweak the range of the plot with arguments xlim and ylim:

all_y_data <- c(idaho$yield_bu_per_ac, illinois$yield_bu_per_ac)
y_range <- range(all_y_data) # This is a vector of length 2

plot(idaho$year, idaho$yield_bu_per_ac,
     xlab='Year', ylab='Yield (bu/ac)',
     main='Idaho Corn Yields have Increased Since 2000',
     col="blue", ylim=y_range)
points(illinois$year, illinois$yield_bu_per_ac,
       col="red")

plot with corrected range

What to notice?

Starting with an empty plot

A common plot strategy is to start with an empty plot, then add points() from different sources.

x_range <- range(idaho$year)

plot(1, type="n",
     xlab='Year', ylab='Yield (bu/ac)',
     main='Idaho Corn Yields have Increased Since 2000',
     xlim=x_range, ylim=y_range)
points(idaho$year, idaho$yield_bu_per_ac,
       col="blue", pch=16)
points(illinois$year, illinois$yield_bu_per_ac,
       col="red", pch=15)

Notice that the function call to points() is slightly repetitive which can be done with a for-loop later.

Breaking the calls this way makes plot() handle the shared properties across data where points() will handle the point locations, colors, and plotting characters for different groups of the data. This division of responsibility for different functions is a good way to think about structuring your code.

Legends

To label the different points, we will use a legend.

plot(1, type="n",
     xlab='Year', ylab='Yield (bu/ac)',
     main='ID/IL Corn Yields have Increased Since 2000',
     xlim=x_range, ylim=y_range)
points(idaho$year, idaho$yield_bu_per_ac,
       col="blue", pch=16)
points(illinois$year, illinois$yield_bu_per_ac,
       col="red", pch=15)
legend("bottomright", legend=c("Idaho", "Illinois"),
       col=c("blue", "red"), pch=1)

plot with legend

Here we introduced several arguments within the legend() function

Changing the property for each point with vectors

Just like how each point can have its location specified using a vector, we can also use vectors to change each point’s color and plotting character.

For the example, we are switching away from real data for clarity

plot(1:20, 1:20, pch=1:20)

pch possibilities

Notice how each point has a different plotting character starting from 1 to 20. The same can be done with colors. Instead of using the usual character strings, we’re going to use the rgb() function that specifies the amount of red, green, vs blue coloring.

plot(1:5, 1:5, pch=16, col=c(rgb(0, 0, 0),
                             rgb(1, 0, 0),
                             rgb(0, 1, 0),
                             rgb(0, 0, 1),
                             rgb(0.9, 0.9, 0.9))
    )

pch possibilities

And you can change both at the same time:

plot(1:5, 1:5, pch=c(1, 15, 16, 3, 17),
     col=c(rgb(0, 0, 0), rgb(1, 0, 0),
           rgb(0, 1, 0), rgb(0, 0, 1),
           rgb(0.9, 0.9, 0.9))
    )

pch possibilities

Plotting different states with different colors: for-loops

Imagining carrying out the example above for the different states in our original dataset for historical USDA corn yields. Our example above was slightly tedious to type out so we will use the for-loop to handle the different states.

df <- read.csv("usda_i_state_corn_yields_cleaned.csv")
states <- df$state_name
years <- df$year
uniq_states <- unique(states)
colors <- c("red", "blue", "black", "purple")

x_range <- range(years)
y_range <- range(df$yield_bu_per_ac)
plot(1, type="n",
     xlab='Year', ylab='Yield (bu/ac)',
     main='Corn Yields Increased across all i-States',
     xlim=x_range, ylim=y_range)
for(i in seq_along(uniq_states)){
    is_target_state <- states == uniq_states[i]
    # Subset only the records that correspond to the state of interest
    sub_df <- df[is_target_state, ]
    points(sub_df$year, sub_df$yield_bu_per_ac,
           col=colors[i], pch=16)
}
legend('bottomright', legend=uniq_states,
       col=colors, pch=16)

Factors

There’s another strategy that can plot quickly for different subgroups using a special data type called factors.

To give you an idea about the properties of factors:

char_demo <- c("red", "yellow", "yellow", "green", "red")
fac_demo <- as.factor(char_demo)
class(fac_demo)
# [1] "factor"

# Property 1, factors have levels
levels(fac_demo)
# [1] "green"  "red"    "yellow"

# Factors can be turned into numbers or characters
print(fac_demo)
print(as.numeric(fac_demo))
print(as.character(fac_demo))

Notice how the numbers from as.numeric correspond to the order of the output from levels().

Levels are a unique attribute of factors!

A very special behavior in R is that if we try to subset with factors, it’s like subsetting with numeric values (in particular, the value corresponding to the different levels)!

levels(fac_demo)
# Notice the actions are intentionally chosen
# to be in order of the levels!
traffic_actions <- c('go', 'stop', 'yield')
print(traffic_actions[fac_demo])
print(fac_demo)

You can imagine that what happened in the subsetting above is that the factors were turned into numbers (according to the order of their level), then they were used to subset the vector.

What’s new is that, before, we have not subsetted the same value multiple times. Here, we will use this behavior to help us assign the different points different colors according to a factor.

Exercises

Corn trajectories

Below we re-write the code from the for-loop above using factors.

df <- read.csv("usda_i_state_corn_yields_cleaned.csv")
states <- df$state_name
print(class(states))
colors <- c("red", "blue", "black", "purple")
plot(df$year, df$yield_bu_per_ac,
     pch=16,
     xlab='Year', ylab='Yield (bu/ac)',
     main='Corn Yields Increased across all i-States',
     col=colors[states])
legend('bottomright', legend=levels(states),
       col=colors, pch=16)

plot using factors

What to notice:

Saving plots with png()

If you want to save an image file using code, you can surround the plotting code between a png("FILE_NAME.png") call and a dev.off() call.

Warning, the code below will create 3 “.png” files in your working directory!

colors <- c('red', 'blue', 'black')
for(i in seq_along(colors)){
    file_name <- paste0('test_plot_', colors[i], '.png')
    png(file_name)
    plot(1:4, 1:4, pch=1:4, col=colors[i])
    dev.off()
}
list.files()

You do not need to have a for-loop to save the for-loops but this is an example for you to create similar plots over different iterations.

Note that a common mistake is that people forget the dev.off() call.

Exercise

Review