Wayne's Github Page

A place to learn about statistics

Homework 4: Prediction vs Inference

Here are the formulas in textbooks and from our slides when all 5 assumptions are satisfied under simple linear regression:

Note that:

\(\hat{SE}(\hat{\beta}_1 \vert X)\) differs from \(SE(\hat{\beta}_1 \vert X)\) becase it estimates \(\sigma^2\) with \(\hat{\sigma}^2\)

Q1 - Verifying calculations from summary.lm()

Please create a matrix or data frame called summ_tab using the formulas above that matches the output from the coefficients attribute under summary.lm() You may use basic functions like pt(), mean(), sd(), corr(), sum(), /, etc

Please generate your own random dataset to demonstrate this.

Hint: df stands for degrees of freedom.

Q2 - different ways to generate data

Please first generate data using the following code:

n <- 50
x <- rnorm(n)
z <- runif(n)
w <- rexp(n)
indep_vars <- cbind(x, z, w)

errors <- rnorm(n, sd=10)
betas <- c(1, 2, 3, 4)
y <- betas[1] + betas[2] * x + betas[3] * z + betas[4] * w + errors
# The following will produce an error
should_be_y <- indep_vars %*% betas + errors

The %*% symbol is how we tell R to perform matrix multiplication. The last line above should produce an error. Please modify indep_vars such that y and should_be_y are almost identical (within machine precision). Be sure to demonstrate that the two variables, after your fix, are almost identical.

FYI, R implicitly treats vectors as matrices as n by 1 matrices when doing matrix operations.

Q3 - Paper Trustworthiness (long)

You’ll need the CSV file on Ed Resources titled “trustworthiness.csv”. This is shared by the authors of Trustworthiness of Crowds is gleaned in half a second and processed by your instructor.

You do not need to read the whole paper to do this problem but the lead author will speak at our class on 11/8/2023.

We will focus on the 3rd study, where people role-play as investors for individual business associates or groups of business associates. They could invest any amount between $0.00 and $1.00 in $0.25 increments. The investment will triple but the amount returned to the investor will depend on the business associates. The business associates’ faces were exposed only for 500ms to the investors. There were 50 photos in each categories: individual face (a single face was shown), an ensemble of faces (ensemble which includes the single face), and an individual’s face surrounded by other members (highlighted).

Q4 - confidence intervals for the line from different bootstraps

Use the file hw4_scatter.csv, please reproduce the following plot:

conf_inter_boot

Hint:

Q5 - intuition practice

Q6 - Prediction Intervals

First generate your data following the process below:

set.seed(100)
n <- 50
x_range <- 1:20
beta0 <- -5
beta1 <- 1.2
x <- sample(x_range, n, replace=TRUE)
y <- beta0 + beta1 * x + rnorm(n)

Do not overwrite your original x, y data pairs.

Q7 - Prediction or inference

For each of the following, please state whether we should care about the prediction interval or the confidence interval for the regression line, no need to explain your choice.

Q8 - How categorical values are turned into numbers and handled in regression

Run the following code

n <- 50
cat_variable <- c('A', 'B', 'C')
group <- factor(sample(cat_variable, n, replace=TRUE), levels=cat_variable)
y <- rnorm(n)

X <- model.matrix(~group)
mod <- lm(y ~ group)
mod2 <- lm(y ~ X)