Wayne's Github Page

A place to learn about statistics

Homework4 - Preparing for your final project

Goals

The purpose of this homework is to give you some time to work on your final project

Question 0: Final project check in

Please create one graph that visualizes the data for your final project. If you’re still collecting data, please grab external sources of data that would give you a sense of what the data may look like. For example, if I’m waiting for the sleeping data of my baby (who is not born yet), I might get data (or a plot) from medical journals on what is the expected sleeping duration for a new born child.

Question 1: Comparing kriging to weighted averages

Let’s try to simulate data that has a spatial dependence. Our goal is to create data \(\epsilon = \begin{bmatrix}\epsilon(s_1) \\ \vdots \\ \epsilon(s_n) \end{bmatrix}\) where \(Cov(\epsilon(s_i), \epsilon(s_j)) = \sigma^2 \exp(-\frac{d(s_i, s_j)}{\rho}) > 0\). \(s\) indicate the different locations, \(d(s_i, s_j)\) is the distance between \(s_i\) and \(s_j\). So the covariance decreases as distance increases between 2 locations (e.g. NYC’s weather vs Hoboken weather relative to NYC vs SF weather).

# Part 1
set.seed(100)
library(RColorBrewer)
sigma2 <- pi
rho = 10

n_samps <- 100
even_pts <- seq(0, 1, length.out=n_samps)
neighbor_dist <- 1 / n_samps
locations <- expand.grid(even_pts, even_pts)
dist_mat <- as.matrix(dist(locations, diag=TRUE, upper=TRUE))
cov_mat <- sigma2 * exp(-dist_mat/rho)
noise <- rnorm(nrow(locations))
L <- chol(cov_mat)
corr_noise <- L %*% noise
mean_fun <- function(locs){locs[, 1] + locs[, 2] * 3}
y <- mean_fun(locations) + corr_noise 

# VISUALIZATION CODE GOES HERE

# Part 2-4
inner_square <- (locations[, 1] > 0.1
                 & locations[, 2] > 0.1
                 & locations[, 1] < 0.9
                 & locations[, 2] < 0.9)
test_locs <- sample(which(inner_square), 100)
get_biased_neighbors <- function(loc, locs, n=10){
    x <- loc[[1]]
    y <- loc[[2]]
    neighbors <- abs(locs[, 1] - x) < 0.1 & abs(locs[, 2] - y) < 0.1
    south_west_neighbors <- (locs[, 1] < x
                             & locs[, 2] < y
                             & neighbors)
    north_east_neighbors <- (locs[, 1] > x
                             & locs[, 2] > y
                             & neighbors)
    train <- sample(which(south_west_neighbors), n - 1)
    train <- c(train, sample(which(north_east_neighbors), 1))
    return(train)
    }
train_locs <- lapply(test_locs, function(ind) get_biased_neighbors(locations[ind, ], locations))

Question 2: Comparing sampling methods

Please generate data using the following code:

set.seed(5833)
n <- 2^20
# create 5 states with one clear minority
states <- rep(c('A', 'B', 'C', 'D'), c(n/2, n/2^2, n/2^8, n/2^8))
states <- c(states, rep('E', n - length(states)))
# table(states) / n
streets <- vector('integer', n)
start <- 1
end <- 0
for(state in unique(states)){
    state_total <- sum(states == state)
    end <- end + state_total
    # let there be around 500 people on every street within a state
    streets[start:end] <- sample(seq_len(round(state_total / 500)), state_total, replace=TRUE)
    start <- end + 1
}
response <- rnorm(n, ifelse(streets < 5, -20, 0), ifelse(states %in% c('B', 'C'), 20, 5))
special <- 'D'
response[states == special] <- response[states == special] + rnorm(sum(states == special), 100, sd=50)

This is simulating a scenario where we could leverage people’s state and their street to survey them for a variable, called response. It’s important to know that “streets” are nested within the state, i.e. street 1 in State A is not the same as Street 1 in State B.

For the following sampling methods:

Please calculate or simulate the following for each of the sampling methods above:

Question 3: Simulation

Imagine a variable, Y, where $Y_{men}\sim N(0, Variance=1)$ and $Y_{women} \sim N(1, Variance=4)$. If we can collect 100 samples from a population of 1000000 people (you should assume exactly 50/50 split between the sex). Which of the following is the most accurate on average for estimating the population average or are they comparable? Please justify your answer with a proof or a simulation.