Wayne's Github Page

A place to learn about statistics

Homework 2: Simulations and Programming and Simple Least Squares

Goals

Homework 2 encourage you to challenge the reasoning behind the “least square” choice. In particular, we have shown that the regression line minimizes the total squared vertical distance between each data point and any other line. In this assignment, we challenge this choice by proposing to minimize the a different distance.

Q1 - an alternative distance

Between an arbitrary point \((u_1, v_1)\), and another arbitrary point \((u_2, v_2)\), the Euclidean distance between them can be written as \(\sqrt{(u_1 - u_2)^2 + (v_1 - v_2)^2}\). Given arbitrary point \((u, v)\) and arbitrary line \(y = a + b * x\), there exists a point, \((x^*, a + b * x^*)\), on the line that minimizes the Euclidean distance between the point \((u, v)\) and the line. The expression for \(x^*= \frac{u + b(v - a)}{1 + b^2}\).

Please write a function that takes in 3 inputs: a vector named par=c(a,b) with the intercept and slope, a vector of u, and a vector v then outputs the total shortest Euclidean distance between these points and the given line. The length of u and v are both n and should correspond to points.

You are NOT allowed to use a for-loop in this function.

Q2 - Translating mathematics to code

We will test out our function on the file hw2_q2.csv.

Q3 - Using programming to find the best line by brute force

Let’s perform a grid search for the best value of a and b by doing the following:

Q4 - Grid search vs numeric optimization

There are packages that can find the best value for a and b for us that does not perform a grid search. Let u and v respectively indicate the columns in the file hw2_q2.csvand let’s call the function you wrote in Q1, q1_fun, then the follow code should return the best values for a and b (EDIT: this was previously q2_fun in last year’s version, the intent is clear to your grader):

opt_ab <- optim(par=c(2, -2), q1_fun, u=u, v=v, method="BFGS")
best_a <- opt_ab$par[1]
best_b <- opt_ab$par[2]

Comment: the (2, -2) is just a place for the algorithm to start searching,

Q5 - Uncertainty from different datasets

We can simulate a different dataset for u and v by running (this is the regression model for data generation):

n <- 100
u <- runif(n, -5, 5)
v <- 2 - 3 * u + rnorm(n, sd=2)

Please simulate 5000 different datasets and use your function from Q4 to create 2 histograms:

Q6 - Visualizing unbiasedness

In Q5, the value 2 and -3 are respectively the true intercept (\(a_{true}\)) and true slope (\(b_{true}\)) that generated the data. Given any dataset, our current method could use \(\tilde{a}\) and \(\tilde{b}\) to estimate \(a_{true}\) and \(b_{true}\) respectively. Using your results from Q5, please answer whether you believe \(E(\tilde{a}) = a_{true}\) AND \(E(\tilde{b}) = b_{true}\), i.e. unbiased, your answer should be a single Yes/No with a short explanation.

Q7 - Different objectives lead to different solutions

Recall that regression minimizes the total squared residual instead of the total shortest Euclidean distance. Let’s name the coefficients from the regression \(\hat{a}\) and \(\hat{b}\) respectively. You can get the regression coefficients using lm(v ~ u)$coefficients in R.

Please plot the scatter plot using the data hw2_q2.csv with u on the x-axis and v on the y-axis along with 2 lines: a dotted line with the intercept \(\tilde{a}\) and slope \(\tilde{b}\) and a solid line with intercept \(\hat{a}\) and slope \(\hat{b}\).

Hint: abline(a = 0, b=1) plots a line with intercept=0 and slope=1 on a plot.

Q8 - Comparing methods

Please repeat Q5 but use the regression coefficients \(\hat{a}\) and \(\hat{b}\) instead of \(\tilde{a}\) and \(\tilde{b}\). You should use the same dataset as those generated from Q5.

Q9 - Making a decision

Comparing your plots from Q5 vs Q8, if you want to estimate \(a_{true}\) and \(b_{true}\), do you prefer \(\hat{a}\) and \(\hat{b}\) or \(\tilde{a}\) and \(\tilde{b}\)? Please give a short explanation.

Q10 - Standard errors

Please show your code and the resulting values for the following:

Q11 - Prediction

Please use the dataset from hw2_q2.csv for this problem.

Please show the code for the following steps: