Wayne's Github Page

A place to learn about statistics

HW2 - Data Mining using Simple Statistics

Context - US’s winner-takes-all voting system

Most of the voting in the US takes the form of “winner-takes-all” which encourages the formation of a two party system. We will see if we can identify the party affiliations simply by looking at the voting patterns.

We’ve collected data from the senate.gov

Q1 Data wrangling of the Senate Voting

In votes.json, you will find the Senate voting records from the 105th to 115th Senate. The file is organized by senators, each with an id encoded by S followed by a digit. For each senator, their voting record for the last version of the bill is recorded, 1 stands for “Yea”, -1 stands for “Nay”, 0 stands for “Not voting”, and -9999 is used if there were issues during data collection. The bills are labeled with the congress term (e.g. 106), the session number (e.g. 1 or 2), then the issue ID.

Please wrangle this dataset into a matrix called voting_matrix so we can calculate the correlation between the senators’ pairwise voting behavior, i.e. we will run cor(voting_matrix, ...) in Q3 to identify how senators’ vote with one another.

Please report the dimensions of your matrix and what percentage of the matrix does not contain -1, 1, or 0. For example, if I only have 2 senators and 2 bills where one out of the four terms is not -1, 1, or 0, then I would report 25%.

Q2 Expectations

Please articulate your expectations for the correlation matrix given the 2 party system.

Q3 Mining with correlations

Please calculate the correlation value between the senators’ voting pattern, please set use=pairwise.complete.obs in the function cor().

Q4 Machine learning intuition with regression

We will compare 2 feature selection methods for regression, with the goal of predicting a senator’s voting pattern. In both methods, for any target senator, we will select 2 other senators/features to predict their vote on any issue.

Please answer the following:

Side comment: