Applied Statistical Methods - Homework 0
Goals
- Refresh and assess your basics from the required courses.
- Encouraging you to setup your jupyter notebook properly.
Format
Please return a PDF file with your solutions on GradeScope.
Questions
Please download the data on CourseWorks Files/Data/unemployment_cpi_unempl_2000_2020.json
This is a dataset from the Bureau of Labor Statistics
regarding inflation and unemployment.
In economics, we believe that low unemployment will lead to higher inflation. The reasoning being that labor is hard to find so employers need to attract workers with higher salaries. With higher salaries, people are willing to pay more for goods and therefore leading to inflation. We will roughly validate this theory at the US national level at a very coarse time scale. You are not expected to have taken an economics class to do this assignment.
The time series data:
- “LNS14000000” is the seasonally adjusted national unemployment rate
- “CWUS0000SA0” is the consumer price index based on urban wage earners
- To obtain inflation from this, we simply calculate the percentage change based on these values.
- These values are referenced such that the prices in years 1982-84=100.
- We will assumed “1st half” = Jan-June and “2nd Half” = July-Dec.
Q0 - Standardizing the dataset
Please read in the JSON data and export a CSV file with the following columns:
- year
- period: this should take on values between “1st Half” or “2nd Half”
- unemployment: this should be an average percentage over the appropriate months
- cpi: this shouldn’t need much processing
Hint: if you haven’t seen a JSON file before, here’s some sample code for R:
library(jsonlite)
data <- read_json("MYFILE.json")
class(data)
print(data[[1]])
Q1 - Calculate the inflation
Using your results from Q0, please calculate the inflation rate and add a column to your
data named inflation. For example, the inflation rate for 2000, 2nd Half should be
(170.2 - 167.6)/167.6 * 100 = 1.55%
. If an inflation cannot be calculated, please
replace the value with NA.
Q1.1 - R’s handling of NA values
- If we have a vector of inflation values called
inflation
, where one isNA
, how does the R commandmean(inflation)
handle theNA
? - How does R’s
lm()
function handle NA values if there are NA values in the independent (the X) or dependent variables (the Y)?
Q2 - Visualizing the data
Please plot the scatter plot between the inflation and unemployment rate with the axes labeled with units. Inflation should be on the y-axis.
Q3 - Fit a regression and analyze the output
Fit an OLS to the scatter plot in Q2 and report the fitted slope and its p-value. Why is the slope relevant to our problem here (at most 3 sentences)?
Q4 - Assumptions
Which assumptions were required for your p-value in Q3 to make sense?
Q5 - Test whether the inflation is independent of the unemployment rate via permutation test
Please write the code for the following:
- Calculate the correlation between the 2 variables and record this value.
- Repeat the following 1000 times: shuffle the order of one of the variables, then recalculate the correlation. In other words, we’re making the two variables independent from one another.
- Please plot the histogram of these recalculated correlations against the original correlation.
- Please comment on what you can infer from the histogram
Q6 - Simulation
This is not related to the quesitons above. Please create a simulation in R that demonstrates the unbiasedness property for estimating the parameter of the linear model can be violated if one of the linear regression assumptions are violated.
Your solution should clearly state:
- What is the parameter you are estimating
- Which assumption are you violating
- What is unbiasedness and how it is not observed in your simulation (this should be a visualization)