Wayne's Github Page

A place to learn about statistics

Scope, Debugging, and the Importance of Naming

Let’s say you wanted to validate all those calculations you did when doing hypothesis testing with the 2 sample t-test.

In the 2 sample t-test, you need to calculate a few things, like averages, standard deviations, then combining these things somehow. To create fake data, you’ll also need to specify some parameters as well. It is very easy to get repeated names to conflict with one another.

We will use this to demonstrate an important concept in programming called “scope” and also teach you how to debug. The code in this section will be intentionally written in an odd/incorrect way to highlight these concepts.

Lightning review of hypothesis testing

A quick reminder, a hypothesis test is similar to a proof by contradiction in mathematics but using data instead of logic. To disprove something, we assume it’s true, then collect some data. If the assumption leads us to some improbably result in the data, we would then reject the assumption.

In the 2 sample t-test, the assumption is often “the 2 groups have the same distribution” and the result in the data is often the difference between the averages. Using our male vs female cops example from our previous lesson, the distribution can refer to the salaries between these two groups. Assume the two groups have the same salary distribution would imply they have the same average salary. This allows us to use the difference between the average salaries (of the groups) as a way to validate our assumption.

Recalling the 2 sample t-test

To validate whether hypothesis testing “works”, you would likely replicate its calculation in code first. The calculations goes as

avg1 <- mean(data1)
avg2 <- mean(data2)

# Getting standard errors
sd1 <- sd(data1)
sd2 <- sd(data2)
s1 <- length(data1)
s2 <- length(data2)
se1 <- sd1 / sqrt(s1 - 1)
se2 <- sd2/ sqrt(s2 - 1)

test_stat <- (avg1 - avg2) / sqrt(se1^2 + se2^2)

To fix the problems mentioned, we might revised the code as follows:

# Create fake data
s1 <- 100
s2 <- 25
set.seed(1)
data1 <- rnorm(s1)
data2 <- rnorm(s2)

avg1 <- mean(data1)
avg2 <- mean(data2)

# Getting standard errors
calc_se <- function(data1){
    # The naming of s1 here is intentionally in conflict with our previous convention
    s1 <- sd(data1)
    ss1 <- length(data1) # ss = sample size
    return(s1 / sqrt(ss1 - 1))
}
s2 <- calc_se(data2)
# Here is yet another conflict of s1
s1 <- calc_se(data1)

test_stat <- (avg1 - avg2) / sqrt(s1^2 + s2^2)

The same variable name used at different places

From the example above, notice the 3 locations where s1 is mentioned:

Recall that when we use the assignment operator, <-, the last one evaluated will overwrite all the previous instances. But we have a twist because one is defined within a function. So what order are each s1 <- actually evaluated?

More importantly, is the code with calc_se() equivalent to if we simply “copy/paste” the body of the function outside? For example:

# Create fake data
s1 <- 100
s2 <- 25
set.seed(1)
data1 <- rnorm(s1)
data2 <- rnorm(s2)

avg1 <- mean(data1)
avg2 <- mean(data2)

# Getting standard errors
# The code below replaces `s2 <- calc_se(data2)`
data1 <- data2
s1 <- sd(data1)
ss1 <- length(data1) # ss = sample size
s2 <- s1 / sqrt(ss1 - 1)
# The code below replaces `s1 <- calc_se(data1)`
data1 <- data1
s1 <- sd(data1)
ss1 <- length(data1) # ss = sample size
s1 <- s1 / sqrt(ss1 - 1)

test_stat <- (avg1 - avg2) / sqrt(s1^2 + s2^2)

Debugging 101 - Adding print statements

To understand the behavior of the code is the number 1 skill in debugging code. At these times, people often inject print() statements at different parts of the code to understand the order and context of the code.

In the case where we “copy/pasted” the body of the function out, we can simply add a print() statement before/after s1 is redefined.

# Create fake data
s1 <- 100
print(paste('s1 is now', s1))
s2 <- 25
set.seed(1)
data1 <- rnorm(s1)
data2 <- rnorm(s2)

avg1 <- mean(data1)
avg2 <- mean(data2)

# Getting standard errors
# The code below replaces `s2 <- calc_se(data2)`
data1 <- data2
s1 <- sd(data1)
print(paste('s1 is now', s1))
ss1 <- length(data1) # ss = sample size
s2 <- s1 / sqrt(ss1 - 1)
# The code below replaces `s1 <- calc_se(data1)`
data1 <- data1
s1 <- sd(data1)
print(paste('s1 is now', s1))
ss1 <- length(data1) # ss = sample size
s1 <- s1 / sqrt(ss1 - 1)
print(paste('s1 is now', s1))

test_stat <- (avg1 - avg2) / sqrt(s1^2 + s2^2)

The exercise above should be relatively straightforward since we know the last assignment to s1 will override all the previous versions.

Now, let’s try to add these print statements into the case with our function calc_se() with a few additional print statements. Please guess the value and order each of print() statements will return below before running it:

# Create fake data
s1 <- 100
print(paste('s1 is now', s1))
s2 <- 25
set.seed(1)
data1 <- rnorm(s1)
data2 <- rnorm(s2)

avg1 <- mean(data1)
avg2 <- mean(data2)

# Getting standard errors
calc_se <- function(data1){
    # The naming of s1 here is intentionally in conflict with our previous convention
    print(paste('inside the function: s1 is now', s1))
    s1 <- sd(data1)
    print(paste('inside the function: s1 is now', s1))
    ss1 <- length(data1) # ss = sample size
    return(s1 / sqrt(ss1 - 1))
}
print(paste('after calc_se definition s1 is now', s1))
s2 <- calc_se(data2)
print(paste('between calc_se calls, s1 is now', s1))
# Here is yet another conflict of s1
s1 <- calc_se(data1)
print(paste('finally, s1 is now', s1))

test_stat <- (avg1 - avg2) / sqrt(s1^2 + s2^2)

Order and Scope

If the code above was ran in order, you should have noticed 8 total messages from print(). We will explain each in order!

From the above explanation, there are 2 major concepts, order of code execution and the concept of “scope”.

A graphical view of scope

To visualize the topic of scope, the following may help:

Scope visualized

Exercise:

Quick note on debugging and naming

In general, it’s not a bad practice to add print() statements at different parts of your code to make sure

Notice that this potential confusion could be avoided if we named our variables better as well. Better naming can really decrease the number of bugs in your code!