Wayne's Github Page

A place to learn about statistics

How to simulate Law of Large Numbers?

Setting

The law of large numbers (LLN) is one of the corner stone concepts in intro statistics that explains why larger samples are better.

Specifically, LLN says that sample averages based on a larger sample size will be closer to the population average than sample averages base on a smaller sample size.

To demonstrate that, you’ll need to learn how to:

The following steps should lead you to that final step.

Using programming as a calculator

I’m assuming you’ve used a calculator before. Try out a few of the following calculations in the Jupyter Notebook. To run the code, highlight the cell, then hit “Shift+Enter” or there’s a “Run |>” button in the tab area as well.

Exercises

Intentinoally making mistakes

Please make mistakes early on!

I highly encourage you to make mistakes. The more you do so, and learn, the better at coding you will become. In general, when you are intentionally trying to break the code, it is important that you

For example: If I were to run the following code:

1  1

Exercises

Calculator-like functions

Just like your algebra class where \(y=f(x)\), R has functions.

Similar to our mathematical notation, the most common method to refer to functions are used by typing out their “name” followed by the parentheses (), e.g. log(1)

The simplest example are functions that manipulate a single number.

Notice how the third example does not follow the convention for calling functions yet its similarity to the mathematical notation makes it easy to understand.

Exercises

Variables in programming

In R, you can assign values to variables so you can refer to them later and increase readability of your code (very important!).

To do assign a value to a variable name, you can use the <- or = symbol with the variable name on the left and the value on the right.

This allows you to refer to them later when you have repeated use of the same variable (e.g. think about the sample size n in intro statistics that appears in the mean and sample variance calculation).

Later, if you forget what’s assigned to the variable, usually you can use the print() function or simply type in the variable name at the end of the cell to get the value returned.

print(demo_num)
demo.char

Exercises

Overwriting variables

What if I want to update a variable? You simply assign a new value to it.

I recommend running these 3 chunks separately but in order:

my_name <- 'Wayne Lee'
print(my_name)

my_name <- 'Wayne Tai Lee'
print(my_name)

my_name <- 0
print(my_name)

A few things to note:

What is a function?

A function in programming is similar to a function in mathematics \(f(x)=y\). Given inputs, it performs some operations and (usually) returns an output.

We’ve introduced calculator-like functions before like log() and sin() but functions can be more complex:

Creating data with functions

Besides the usual calculator-like functions, the most basic functions are those that create a collection of values, this is how we will represent “data” in programming.

For example, to make a dice with the 6 sides or a coin with the 2 sides, we can use the c() function to combine the different values into one variable. This type of data is called a vector.

dice <- c(1, 2, 3, 4, 5, 6)
coin <- c("Head", "Tail")
empty_vec <- c()
print(dice)
print(coin)
print(empty_vec)

A few things to note:

This is not the only way to create vectors. There are two other common alternatives:

Exercises

Why do we care about vectors in statistical computing?

In statistics, we often talk about a sample of data, \(y_1, \dots, y_n\) where \(n\) is the sample size.

In statistical computing, the corresponding equivalent is a numeric vector of length n where the first element in the vector is \(y_1\), second element is \(y_2\), etc.

The most basic random variable is the outcome of \(n\) coin tosses where heads correspond to 1 and tails correspond to 0.

To generate this kind of data, you could use the following commands:

n <- 20
coin <- c(0, 1)
coin_tosses <- sample(coin, n, replace=TRUE)
print(coin_tosses)

In the code above:

Properties of vectors

It’s important to know the properties of vectors because these properties limit how different functions can interact with them.

Before continuing below, first define a variable called dice <- c(1, 3, 5, 2, 4, 6)

Exercises

Biggest confusion in R

The most confusing thing about R is that a single value (e.g. a single number) is a vector of length 1.

num_vec <- 1.96
num_vec1 <- c(1.96)
num_vec == num_vec1
class(num_vec) == class(num_vec1)
print(num_vec[1])
print(num_vec1[1])

If you come from a classic programming background, num_vec is a single number (scalar), where num_vec1 should be a vector with single element, the element being a scalar.

The analogy is similar to an individual (num_vec) vs a team with only one member (num_vec1). These are 2 conceptually different things.

However, in R, given its focus on data, the most basic element is a vector which can cause some unintuitive behaviors sometimes.

Functions on vectors

A popular operation we perform on data is to take the average. To do this in R, we can use the mean() function

n <- 20
coin <- c(0, 1)
coin_tosses <- sample(coin, n, replace=TRUE)
mean(coin_tosses)

A few things to note:

Elements of functions (big picture)

We will use our previous example with sample() to elaborate on the elements of a function.

coin <- c(0, 1)
coin_tosses <- sample(coin, 10, replace=TRUE)
coin_tosses

sample() here is a function that is tossing the coin 10 times, it has a few components that you should be aware of

Exercise

Inputs/arguments to functions

Same example as before:

n <- 20
coin <- c(0, 1)
coin_tosses <- sample(coin, n, replace=TRUE)

If you want to look into the documentation, run the code ?sample. You should notice the order and names of the inputs/arguments:

sample(x, size, replace = FALSE, prob = NULL)

How to read documentation in R

Documentation is often not well written but here are a few tips:

More detailed explanations on sample()

Again from the documentation: sample(x, size, replace = FALSE, prob = NULL)

In general, only very popular functions will have good documentation and explanations. You should get comfortable “testing” functions to see what will happen vs not.

Avoiding repetitive tasks: for-loops

In statistics, we often talk about probability as something happening over repeated trials. To emulate that, we can use for-loops.

First, run a simple for-loop example in R:

values_to_loop_over <- c(1, 5, 8)
for(i in values_to_loop_over){
    print(i)
}

At this level, it’s fine to think about the for-loop above as condensing this code:

values_to_loop_over <- c(1, 5, 8)
i <- values_to_loop_over[1]
print(i)
i <- values_to_loop_over[2]
print(i)
i <- values_to_loop_over[3]
print(i)

Notice how the repeated code is print(i). This repeated piece will often become the body of the for-loop. On the otherhand, i is simply taking on the next value within values_to_loop_over for each loop/iteration. This slight change is controlled by whatever you provide in values_to_loop_over.

Exercise

Elements of the for-loop

values_to_loop_over <- c(1, 5, 8)
for(i in values_to_loop_over){
    print(i)
}

The above loop has a few elements to consider:

Common for-loops mistake - overwriting your variables

For example, if we wanted to simulate 3 coin tosses but you were restricted to only using sample(c(0, 1), 1), i.e. your function could only toss one coin at a time.

A natural instinct is to write a loop like the following:

arbitrary_vec <- 1:3
for(i in arbitrary_vec){
    coin_toss <- sample(c(0, 1), 1)
}

The issue with this loop is that coin_toss is over-written from loop to loop because you can re-imagine the code as:

arbitrary_vec <- 1:3
i <- arbitrary_vec[1]
coin_toss <- sample(c(0, 1), 1)
i <- arbitrary_vec[2]
coin_toss <- sample(c(0, 1), 1)
i <- arbitrary_vec[3]
coin_toss <- sample(c(0, 1), 1)

Collecting outputs over each loop

The issue above was that any variable defined within the loop will get overwritten in each loop.

There are 2 strategies to overcome this issue (recall our goal is to create 3 coin tosses using a for-loop).

Below are examples for the 2 strategies using the same example before. At the end of the loop, notice how a variable coin_tosses will be created that will contain the results from the coin_tosses across each loop.

First strategy:

arbitrary_vec <- 1:3
coin_tosses <- c()
for(i in arbitrary_vec){
    coin_toss <- sample(c(0, 1), 1)
    coin_tosses <- c(coin_tosses, coin_toss)
}

Second strategy:

sequential_integers <- 1:3
coin_tosses <- c()
for(i in sequential_integers){
    coin_toss <- sample(c(0, 1), 1)
    coin_tosses[i] <- coin_toss
}

Exercises

Special note on c()

Above, we used the fact that given two vectors, c() will combine the inputs into a single vector.

This is logically consistent with how c(1, 2, 6) behaves because each number is also considered a vector of length 1.

Given this, you need to think about c() as a function that combines multiple vectors into a single vector.

First Statistical Simulation!

We’ve learned:

Now we can run simulations for the Law of Large Numbers!

The Law of Large Numbers says that sample averages with larger sample sizes will be “closer” to the population average. Instead of the standard error, however, we will use the mean absolute error for simplicity.

To demonstrate this, we will use the following template. I encourage you to

# Creating the unknown population mean
fair_coin <- c(0, 1)
biased_coin <- sample(fair_coin, 7, replace=TRUE)
population_avg <- mean(biased_coin)

# We will change the sample size to see the impact on the final result
sample_size <- 10

# The loop contains the repeated process of calculating different samples,
# calculating their corresponding sample averages, then calculating the
# the absolute error from the true population average.
abs_errors <- c()
num_simulation <- 1000
for(i in 1:num_simulation){
    sim_sample <- sample(biased_coin, sample_size, replace=TRUE)
    sample_avg <- mean(sim_sample)
    abs_error <- abs(sample_avg - population_avg)
    abs_errors[i] <- abs_error
}

# Finall, to evaluate "closeness" we will simply take the average
# across all absolute errors, a smaller value here would indicate
# "closer"
mean(abs_errors)

Comment on the code: the code above is intentionally written in a very verbose fashion.

Exercise

Best practices on writing for-loops for beginners

In general, when you are about to write a for-loop, the last thing you want to write is the for(){} statement.

Given the law of large number example, the first things that should be written is the final result, the mean absolute error.

mean_abs_error <- mean(abs_errors)

This line won’t run because abs_errors is not defined yet. abs_errors, however, is a collection of different values of abs_error, otherwise the average operation wouldn’t make sense.

abs_errors <- c()
i <- 1

abs_errors[i] <- abs_error
mean_abs_error <- mean(abs_errors)

Following the same logic, abs_error has not been defined yet.

abs_errors <- c()
i <- 1

abs_error <- abs(sample_avg - population_avg)
abs_errors[i] <- abs_error
mean_abs_error <- mean(abs_errors)

Now sample_avg and population_avg are not defined. But both of these require us to define a population. This would normally be provided to you or defined by the problem context:

fair_coin <- c(0, 1)
biased_coin <- sample(fair_coin, 7, replace=TRUE)
population_avg <- mean(biased_coin)

sample_size <- 10

abs_errors <- c()
i <- 1

sample_data <- sample(biased_coin, sample_size, replace=TRUE)
sample_avg <- mean(sample_data)
abs_error <- abs(sample_avg - population_avg)
abs_errors[i] <- abs_error
mean_abs_error <- mean(abs_errors)

Now we need to repeat the calculation with a for-loop. This requires us to replace the i=1 since that will be defined by the for-loop instead.

fair_coin <- c(0, 1)
biased_coin <- sample(fair_coin, 7, replace=TRUE)
population_avg <- mean(biased_coin)

sample_size <- 10

abs_errors <- c()
for(i in 1:1000){
    sample_data <- sample(biased_coin, sample_size, replace=TRUE)
    sample_avg <- mean(sample_data)
    abs_error <- abs(sample_avg - population_avg)
    abs_errors[i] <- abs_error
}
mean_abs_error <- mean(abs_errors)

By fixing the naming afterwards, the code is finished!

Review

In this section, we have learned about