Wayne's Github Page

A place to learn about statistics

HW2 - Processing Data before Applying Algorithms

Q0 - Simulating dependency

Please show, via math or simulation, that if \(Var(X) = Var(Z) = \sigma^2\), and \(X\) is independent from \(Z\). Then if we define:

\[Y = \alpha X + \sqrt{1 - \alpha^2} Z\]

Then

For those using a simulation, using a single “setting” is sufficient.

Q1 - The Problem of Collinearity

Below we’ll simulate some regression data, where Y depends on H that is a numeric matrix with 3 columns.

n <- 500

beta <- c(0, -1.5, 1)
p <- length(beta)
# H stands for hidden
H <- matrix(rnorm(n * p, mean=5, sd=3), ncol=p)
# Intercept is 0 as well
Y <- H %*% beta + rnorm(n, sd=4)

Use the result from Q0 to create a X feature matrix that is \(n \times 100\) where 10 columns depend on \(H_{.,1}\), i.e. the first column of H, 20 columns depend on \(H_{.,2}\), and 70 columns depend on \(H_{.,3}\). Please make this dependency such that the correlation between each \(H_{.,i}\) and each column of X is between 0.5 and 0.9.

Things to think about: do different normalizing procedures matter in this simulation?

Q2 - Weather

The data is from https://www.ncei.noaa.gov/products/land-based-station/us-historical-climatology-network

Use the data in CourseWorks/Files/Weather/*

Please do the following:

Thought experiment (no work required for this): if we transposed X, what type of relationship would we be trying to understand, i.e. flipping space as measurements and year/months as features.

Q3 - citations

In CourseWorks/Files/Citations/*, you’ll see some citation data.

Please do the following:

Q4 - Explore voting patterns of senators using PCA

We’ve collected data from the senate.gov saved in CourseWorks/Files/Congress/*.

In votes.json, you will find the Senate voting records from the 105th to 115th Senate. The file is organized by senators, each with an id encoded by S followed by a digit. For each senator, their voting record for the last version of the bill is recorded, 1 stands for “Yea”, -1 stands for “Nay”, 0 stands for “Not voting”, and -9999 is used if there were issues during data collection. The bills are labeled with the congress term (e.g. 106), the session number (e.g. 1 or 2), then the issue ID.

Use PCA to explore the voting pattern. Please at least write down: