Wayne's Github Page

A place to learn about statistics

Homework 5: More challenges with Multivariate Regression

Q0 - Collinearity and Cross validation

On Ed Resources, there’s a file named processed_nyc_payroll_2022.csv that was originally from NYCOpen Data

We will explore the behavior of regression with collinearity through this example. Our main variable of interest is the total pay for each individual and we will learn its relationship with all the other features.

Q1 - explore the data

Q2 - fitting an OLS with collinear features

Continuing from Q1

Side comment: the model is bad, don’t trust the results. A lot would need to go into understanding these posted payroll values.

Q3 - Cross validation

Hopefully you’re questioning why we bothered with square rooting the total pay? We will use cross validation to make this argument.

We have 2 competing models:

Side comment: I highly encourage you to try to make an intuitive argument for BOTH models before doing the analysis.

Main question: please do a 10 fold cross validation (notice doing Jackknife should feel a bit intimidating for most laptops) to see which method of prediction has a lower mean absolute prediction error.

Some specifications:

Side comment: Why do you think this is?

Q4 - Global preferences

On Ed Resources, you’ll find a file called global_pref_survey_individual.csv. This is the individual survey data referenced in our reading Global Evidence on Economic Preferences. Before you start, make sure your R’s major version is at least 4.0 (you can check via R.version). If not, it’s recommended that you update R. The biggest difference is whether character columns are read in as factors or characters.

Please answer the following using this dataset: