How to simulate Law of Large Numbers?
Setting
The law of large numbers (LLN) is one of the corner stone concepts in intro statistics that explains why larger samples are better.
Specifically, LLN says that sample averages based on a larger sample size will be closer to the population average than sample averages base on a smaller sample size.
To demonstrate that, you’ll need to learn how to:
- Create a collection of data
- vectors
- random data generation
- Assign the data to variables
- Apply the average function over the data
- functions, defaults, different input types
- Simulate the sample average multiple times
- for-loops
- Collect the outcome from each simulation
- for-loops
The following steps should lead you to that final step.
Using programming as a calculator
I’m assuming you’ve used a calculator before. Try out a few of the following calculations in the Jupyter Notebook. To run the code, highlight the cell, then hit “Shift+Enter” or there’s a “Run |>” button in the tab area as well.
1
2*3
3 +4
3 + 4 * -1
(3 + 4) * -1
3 + 4 * - 1
log(1)
Intentinoally making mistakes
Please make mistakes early on!
I highly encourage you to make mistakes. The more you do so, and learn, the better at coding you will become. In general, when you are intentionally trying to break the code, it is important that you
- read the code
- anticipate the error/warning (or lack of error/warning)
- run the code
- read the error message or lack of error message
- make sense of the error message
For example: If I were to run the following code:
1 1
- read the code
- I see two
1
s are separated by a few white spaces
- I see two
- anticipate the error/warning (or lack of error/warning)
- I expect R to not return a warning and just print out two 1’s separated by the same number of white spaces
- run the code
- copy/paste the
1 1
into the Notebook and run it
- copy/paste the
- read the error message
Error: unexpected numeric constant in "1 1"
- make sense of the error message
- This is not obvious! Ask the instructor/TA for an explanation.
- Instructor’s attempt at explaining:
When R is interpreting that command, the error is referring to the second 1 being unexpected. After typing in the first 1 followed by spaces, R is anticipating you to operate with it like + , *, /, etc. However, it encounters the second 1 instead of what it was anticipating. R decided it could not interpret the intent of the command so it throws an error and does not run the command. You can verify by running
1 "a"
and see how the error message changes.
Calculator-like functions
Just like your algebra class where \(y=f(x)\), R has functions.
Similar to our mathematical notation, the most common method
to refer to functions are used by typing out their “name” followed
by the parentheses ()
, e.g. log(1)
The simplest example are functions that manipulate a single number.
log(1)
sin( 90 / 180 * pi)
2^3
Notice how the third example does not follow the convention for calling functions yet its similarity to the mathematical notation makes it easy to understand.
Variables in programming
In R, you can assign values to variables so you can refer to them later and increase readability of your code (very important!).
To do assign a value to a variable name, you can use the <-
or =
symbol with the variable name on the left and the value on the right.
-
demo_num <- 2 print(demo_num * 3) print(demo_num - 4)
demo.char <- "hello"
-
demo_num2 = -0.2 abs(demo_num2)
This allows you to refer to them later when you have repeated use of the same variable (e.g. think about the sample size n in intro statistics that appears in the mean and sample variance calculation).
Later, if you forget what’s assigned to the variable, usually you can use the print()
function or simply type in the variable name at the end of the cell to get the value returned.
print(demo_num)
demo.char
Overwriting variables
What if I want to update a variable? You simply assign a new value to it.
I recommend running these 3 chunks separately but in order:
my_name <- 'Wayne Lee'
print(my_name)
my_name <- 'Wayne Tai Lee'
print(my_name)
my_name <- 0
print(my_name)
A few things to note:
- The last assignment will dictate the value in
my_name
(try playing with the order and find out) - We were able to assign numbers to something that was a character string before
- One of the biggest source of error in beginner coders is that you forget what order your code was ran in and you recycle the same variable name which causes issues in the code later.
- Good naming is considered one of the hardest tasks in programming :)
What is a function?
A function in programming is similar to a function in mathematics \(f(x)=y\). Given inputs, it performs some operations and (usually) returns an output.
We’ve introduced calculator-like functions before like log()
and sin()
but functions can be more complex:
- The inputs can be more complex than a single number (e.g. a matrix!)
- They can take in multiple inputs
- Functions can have defaults, e.g. if you do not specify a significance level, we often default to 5%.
- Outputs can be more complex than a single number as well
Creating data with functions
Besides the usual calculator-like functions, the most basic functions are those that create a collection of values, this is how we will represent “data” in programming.
For example, to make a dice with the 6 sides or a coin with the 2 sides,
we can use the c()
function to combine the different values into
one variable. This type of data is called a vector.
dice <- c(1, 2, 3, 4, 5, 6)
coin <- c("Head", "Tail")
empty_vec <- c()
print(dice)
print(coin)
print(empty_vec)
A few things to note:
- The inputs can be numbers or characters
- The inputs are separated by
,
- There are multiple inputs
c
is the function name- Functions that create data tend to be able to take in arbitrary number of inputs
This is not the only way to create vectors. There are two other common alternatives:
- Creating a sequence of numbers, incremented by
1
using the:
example1:4
- Creating a vector by repeating another vector.
rep(1, 5)
Why do we care about vectors in statistical computing?
In statistics, we often talk about a sample of data, \(y_1, \dots, y_n\) where \(n\) is the sample size.
In statistical computing, the corresponding equivalent is a numeric vector of length n where the first element in the vector is \(y_1\), second element is \(y_2\), etc.
The most basic random variable is the outcome of \(n\) coin tosses where heads correspond to 1 and tails correspond to 0.
To generate this kind of data, you could use the following commands:
n <- 20
coin <- c(0, 1)
coin_tosses <- sample(coin, n, replace=TRUE)
print(coin_tosses)
In the code above:
- We’ve assigned the value 20 to the variable
n
- We’ve created a vector named
coin
, with 2 values,0
and1
. - We’ve then used the function
sample()
to generaten
random numbers (each 0 or 1). (Not important yet: thereplace=TRUE
statement just means that each random number is drawn fromcoin
with replacement so we can have a sample size that is larger than the total possible numbers.)- We will talk more about this function later.
- The key here is to know that vectors are the most basic form of data!
Properties of vectors
It’s important to know the properties of vectors because these properties limit how different functions can interact with them.
Before continuing below, first define a variable called dice <- c(1, 3, 5, 2, 4, 6)
- The number of elements within a vector is called its length
- Try running the code:
length(dice)
- Try running the code:
- The type of the elements within a vector is the vector’s type (implication of
this statement is that elements within a vector must all share the same type).
class(dice)
- You can subset different elements within the vector using
[]
(we will cover the idea of subsetting later)dice[3]
grabs the 3rd element within the vectordice
dice[c(2, 3)]
grabs the 2nd and 3rd element within the vector,dice
- You can change elements within a vector
-
dice[1] <- 6 print(dice)
-
Biggest confusion in R
The most confusing thing about R is that a single value (e.g. a single number) is a vector of length 1.
num_vec <- 1.96
num_vec1 <- c(1.96)
num_vec == num_vec1
class(num_vec) == class(num_vec1)
print(num_vec[1])
print(num_vec1[1])
If you come from a classic programming background, num_vec
is a single number (scalar),
where num_vec1
should be a vector with single element, the element being a scalar.
The analogy is similar to an individual (num_vec
) vs a team with only one member (num_vec1
).
These are 2 conceptually different things.
However, in R, given its focus on data, the most basic element is a vector which can cause some unintuitive behaviors sometimes.
Functions on vectors
A popular operation we perform on data is to take the average.
To do this in R, we can use the mean()
function
n <- 20
coin <- c(0, 1)
coin_tosses <- sample(coin, n, replace=TRUE)
mean(coin_tosses)
A few things to note:
- Given each value is a 0 or 1, the average is also the fraction of 1’s in the vector.
coin_tosses
is a collection of values that are all passed as one input into the functionmean()
. This is different from howc()
took in multiple inputs that were separated by,
.
Elements of functions (big picture)
We will use our previous example with sample()
to elaborate on the elements of a function.
coin <- c(0, 1)
coin_tosses <- sample(coin, 10, replace=TRUE)
coin_tosses
sample()
here is a function that is tossing the coin 10 times, it has a few components that you should be aware of
- Function name:
sample
- Inputs/arguments: all inputs are separated by
,
within the()
. These inputs to the functionsample
often have names to help you understand their purpose. - How inputs are passed to the function: this is done via the
()
and the different inputs are separated by,
.- The first 2 values are passed in “by order” where the third value is passed in “by name” (they can all be passed in by name or order).
The function has a default order of how inputs are entered which is why the first 2 inputs do not need to be given a name explicitly.
To know the order or the names, you would need to read the documentation for the function by running the code
?sample
in R.
- The first 2 values are passed in “by order” where the third value is passed in “by name” (they can all be passed in by name or order).
The function has a default order of how inputs are entered which is why the first 2 inputs do not need to be given a name explicitly.
To know the order or the names, you would need to read the documentation for the function by running the code
- The consequence and/or output of the function:
- In the example above, an output is generated and assigned into
coin_tosses
, by examiningcoin_tosses
, you’ll see that the output is a numeric vector of 0 and 1 values. - Some functions do not produce “output” but have a consequence on the environment. For example, deleting a variable, changing your working directory (we will explain this more later), etc.
- In the example above, an output is generated and assigned into
Inputs/arguments to functions
Same example as before:
n <- 20
coin <- c(0, 1)
coin_tosses <- sample(coin, n, replace=TRUE)
If you want to look into the documentation, run the code ?sample
.
You should notice the order and names of the inputs/arguments:
sample(x, size, replace = FALSE, prob = NULL)
- Since multiple inputs exist, you can pass inputs into a function by name or by order.
If we want to change the order of inputs for the code above, we can simply pass the
inputs in by name
coin_tosses <- sample(size=n, x=coin, replace=TRUE)
- Inputs with an
=
within the documentation often mean they have default values. In other words, if we do not specify those inputs, the defaults will be used.dice <- 1:6 # Notice that replace=FALSE guaratees no repeats! sample(dice, size=4)
How to read documentation in R
Documentation is often not well written but here are a few tips:
- Pay attention to the variable names. Most functions are written so the meaning of the variables can help you understand the function.
- Look at examples: at the bottom of the R documentaiton often has examples. Pay attention to what what is entered (vector of what type?), what is outputed. Most functions are meant to be simple so a few examples can often help you decipher different functions.
More detailed explanations on sample()
Again from the documentation:
sample(x, size, replace = FALSE, prob = NULL)
x
: here is either a single number OR a vector with length greater than 1 (notice this vector can contain numbers or other values), the function will behave differently in these 2 cases.size
: this is the sample size which will determine the length of the output (i.e.coin_tosses
will be lengthn
in this example)replace
: this can only take on 2 possible values:TRUE
orFALSE
. IfTRUE
, the sampling from the vectorx
will be done with replacement whereFALSE
means the sampling is done without placement. The implication is that ifreplace=FALSE
thensize
cannot be larger than the length ofx
prob
: prob is vector that indicates the probability for each element inx
(in order) to be picked. Ifprob
is not specified, each element withinx
will have the same probability of being picked.
In general, only very popular functions will have good documentation and explanations. You should get comfortable “testing” functions to see what will happen vs not.
Avoiding repetitive tasks: for-loops
In statistics, we often talk about probability as something happening over repeated trials. To emulate that, we can use for-loops.
First, run a simple for-loop example in R:
values_to_loop_over <- c(1, 5, 8)
for(i in values_to_loop_over){
print(i)
}
At this level, it’s fine to think about the for-loop above as condensing this code:
values_to_loop_over <- c(1, 5, 8)
i <- values_to_loop_over[1]
print(i)
i <- values_to_loop_over[2]
print(i)
i <- values_to_loop_over[3]
print(i)
Notice how the repeated code is print(i)
. This repeated piece will often
become the body of the for-loop. On the otherhand, i
is simply taking on
the next value within values_to_loop_over
for each loop/iteration. This
slight change is controlled by whatever you provide in values_to_loop_over
.
Elements of the for-loop
values_to_loop_over <- c(1, 5, 8)
for(i in values_to_loop_over){
print(i)
}
The above loop has a few elements to consider:
- First, the “body” of the loop are enclosed in
{}
- The
i
will be a the varaible that changes from loop to loop - The
values_to_loop_over
will control which valuesi
can take from loop to loop. for(? in ?){???}
is the basic structure that clarifies the different roles within the code.
Common for-loops mistake - overwriting your variables
For example, if we wanted to simulate 3 coin tosses but
you were restricted to only using sample(c(0, 1), 1)
, i.e.
your function could only toss one coin at a time.
A natural instinct is to write a loop like the following:
arbitrary_vec <- 1:3
for(i in arbitrary_vec){
coin_toss <- sample(c(0, 1), 1)
}
The issue with this loop is that coin_toss
is over-written from
loop to loop because you can re-imagine the code as:
arbitrary_vec <- 1:3
i <- arbitrary_vec[1]
coin_toss <- sample(c(0, 1), 1)
i <- arbitrary_vec[2]
coin_toss <- sample(c(0, 1), 1)
i <- arbitrary_vec[3]
coin_toss <- sample(c(0, 1), 1)
- Notice that
coin_toss
will only take on the final result forsample(c(0, 1), 1)
and the 2 previous coin_toss values are essentially discarded. - Notice also that
i
in this case does not serve any purpose but is simply a side effect from the for-loop.
Collecting outputs over each loop
The issue above was that any variable defined within the loop will get overwritten in each loop.
There are 2 strategies to overcome this issue (recall our goal is to create 3 coin tosses using a for-loop).
- When overwriting a variable, the variable will do so by including the output from the loop AND the variable itself.
- Update a variable that was defined outside the loop.
Below are examples for the 2 strategies using the same example before.
At the end of the loop, notice how a variable coin_tosses
will be
created that will contain the results from the coin_tosses across
each loop.
First strategy:
arbitrary_vec <- 1:3
coin_tosses <- c()
for(i in arbitrary_vec){
coin_toss <- sample(c(0, 1), 1)
coin_tosses <- c(coin_tosses, coin_toss)
}
- Notice in this case, we are leveraging
c()
to combine the results from previouscoin_tosses
(notice plural!) with the newestcoin_toss
- Notice how
coin_tosses
is being overwritten in each loop - Notice that we needed to define
coin_tosses
before the loop - Notice how
i
still serves no purpose here!
Second strategy:
sequential_integers <- 1:3
coin_tosses <- c()
for(i in sequential_integers){
coin_toss <- sample(c(0, 1), 1)
coin_tosses[i] <- coin_toss
}
- Notice in this case, we are leveraging the “subset” operation
to update
coin_tosses
in each loop. - Notice we needed to define
coin_tosses
before the loop! - Notice that, in R, you can subset for an index that is longer than the length of the vector.
This is observable from the fact that
coin_tosses
start with length 0 then we update the first element by assigning the output fromcoin_toss
to the empty vector. Subsequence updates are also subsetting indices longer than the vector to extend the vector. - To do this, we had to make sure that
i
is no longer looping over arbitrary values but should be sequential integers to update different values withincoin_tosses
- Notice how
i
now serves a purpose!
Special note on c()
Above, we used the fact that given two vectors, c()
will
combine the inputs into a single vector.
This is logically consistent with how c(1, 2, 6)
behaves
because each number is also considered a vector of length 1.
Given this, you need to think about c()
as a function that
combines multiple vectors into a single vector.
First Statistical Simulation!
We’ve learned:
- How to create vectors that represents data, e.g.
sample()
- How to write a for-loop
- How to apply functions on vectors, e.g.
mean()
Now we can run simulations for the Law of Large Numbers!
The Law of Large Numbers says that sample averages with larger sample sizes will be “closer” to the population average. Instead of the standard error, however, we will use the mean absolute error for simplicity.
To demonstrate this, we will use the following template. I encourage you to
- Read the comments, code, variable names to get an overall understanding of the code
- Update
sample_size
to 100 see the impact on the final result, this should align with your understanding of the law of large numbers. - THEN go back and make sure you understand each line
- What operation is being done:
- assigning a new variable?
- subsetting?
- applying a function to a vector?
- is the output a vector of length 1 (single number) or more?
- What operation is being done:
# Creating the unknown population mean
fair_coin <- c(0, 1)
biased_coin <- sample(fair_coin, 7, replace=TRUE)
population_avg <- mean(biased_coin)
# We will change the sample size to see the impact on the final result
sample_size <- 10
# The loop contains the repeated process of calculating different samples,
# calculating their corresponding sample averages, then calculating the
# the absolute error from the true population average.
abs_errors <- c()
num_simulation <- 1000
for(i in 1:num_simulation){
sim_sample <- sample(biased_coin, sample_size, replace=TRUE)
sample_avg <- mean(sim_sample)
abs_error <- abs(sample_avg - population_avg)
abs_errors[i] <- abs_error
}
# Finall, to evaluate "closeness" we will simply take the average
# across all absolute errors, a smaller value here would indicate
# "closer"
mean(abs_errors)
Comment on the code: the code above is intentionally written in a very verbose fashion.
Best practices on writing for-loops for beginners
In general, when you are about to write a for-loop, the last thing
you want to write is the for(){}
statement.
Given the law of large number example, the first things that should be written is the final result, the mean absolute error.
mean_abs_error <- mean(abs_errors)
This line won’t run because abs_errors
is not defined yet. abs_errors
,
however, is a collection of different values of abs_error
, otherwise
the average operation wouldn’t make sense.
abs_errors <- c()
i <- 1
abs_errors[i] <- abs_error
mean_abs_error <- mean(abs_errors)
Following the same logic, abs_error
has not been defined yet.
abs_errors <- c()
i <- 1
abs_error <- abs(sample_avg - population_avg)
abs_errors[i] <- abs_error
mean_abs_error <- mean(abs_errors)
Now sample_avg
and population_avg
are not defined.
But both of these require us to define a population. This would
normally be provided to you or defined by the problem context:
fair_coin <- c(0, 1)
biased_coin <- sample(fair_coin, 7, replace=TRUE)
population_avg <- mean(biased_coin)
sample_size <- 10
abs_errors <- c()
i <- 1
sample_data <- sample(biased_coin, sample_size, replace=TRUE)
sample_avg <- mean(sample_data)
abs_error <- abs(sample_avg - population_avg)
abs_errors[i] <- abs_error
mean_abs_error <- mean(abs_errors)
Now we need to repeat the calculation with a for-loop.
This requires us to replace the i=1
since that will
be defined by the for-loop instead.
fair_coin <- c(0, 1)
biased_coin <- sample(fair_coin, 7, replace=TRUE)
population_avg <- mean(biased_coin)
sample_size <- 10
abs_errors <- c()
for(i in 1:1000){
sample_data <- sample(biased_coin, sample_size, replace=TRUE)
sample_avg <- mean(sample_data)
abs_error <- abs(sample_avg - population_avg)
abs_errors[i] <- abs_error
}
mean_abs_error <- mean(abs_errors)
By fixing the naming afterwards, the code is finished!
Review
In this section, we have learned about
- Vectors of numeric values
- How to use built-in functions on vectors
- This includes functions that generate data
- This includes functions that summarizes data
- for-loops