Wayne's Github Page

A place to learn about statistics

Homework 6: Multivariate Regression, Logistic Regression, and DAGs

Q0: Logistic Regression

There’s a dataset on Kaggle to help people detect pulsars from Dr. Robert Lyon. Please read the following post then download its corresponding dataset pulsar_data_train.csv:

The labels are binary whether an object is a pulsar or not and the features are continuous values.

Q0.1 - Interpretation of metrics

Please read the following metrics that are common in classification then answer the following questions afterwards.

For each record, its label, \(Y\), and our prediction for its label, \(\hat{Y}\) will place it in one of the categories below:

  \(\hat{Y} = 1\) \(\hat{Y} = 0\)
\(Y = 1\) A B
\(Y = 0\) C D

Using the definitions above, here are the list of metrics:

Here are the list of questions:

Q0.2 - Comparing models

A common way to compare models is to calculate the trade-off between recall and precision over different thresholds. In our setting, the thresholds, \(\alpha\), would be the cutoff used to convert \(\hat{p}(X)\) to \(\hat{Y}\), i.e.

\[\hat{Y}|X = \begin{cases} 1 & \text{if } \hat{p}(X) \geq \alpha \\ 0 & \text{otherwise}\\ \end{cases}\]

For linear regression, let \(\hat{p}(X) = X\hat{\beta}\) and the definition for \(\hat{Y}\) is change to the function above.

Please compare the linear regression model against the logistic regression model with the following criteria:

Side comment: The models should be similar but are they identical in their prediction? This answer can be different depending on the number of features!

Q1 - Wrong models are not always wrong

Let’s create \(n=1000\) samples from the following data generation process.

To give some context, you can imagine that \(X\) is the number of seeds, \(Y\) is the amount of fertilizer, and \(Z\) is the amount of yield.

Q1.1 - DAG

Is \(Z\) a confounder, collider, or mediator in the data generation process?

Q1.2

Please train 3 models (please include intercepts for all of the following):

Please generate 1000 new data points \((X, Y, Z)\) triplets from the true data generation. When doing prediction, please assume the new values of \(Z\) and \(X\) are given but \(Y\) is not observable (these are only observable when you’re calculating the error rate). To do prediction with \(model_{true}\), please calculate \(\hat{Y}_{true} = (Z - \hat{\alpha}_x X - \hat{\alpha}_0) / \hat{\alpha}_y\).

Side comments: in our context, imagine I’m a seed dealer, I know the yield of my customers, and now I want to estimate the amount a typical customer is spending on fertilizer. Notice how something intuitive can be quite bad.

Q1.3 - Diagnosing issues

From Q1.2, you should have seen that \(model_{true}\) performed worse than the other 2 models.

Side comment: why is \(model_{true}\) doing worse? The answer is in the plot!

Q1.3 - Inference

For \(model_{xz}\), we fitted a model \(Y = \beta_0 + \beta_x X + \beta_z Z + error\). It seems reasonable to use the estimate \(\hat{\alpha}_y^{xz} = \frac{1}{\hat{\beta}_z}\) where \(model_{true}\) has a natural estimate from the regression model for \(\alpha_y\).

Please use simulation to understand if we should use \(model_{xz}\) or \(model_{true}\) to estimate \(\alpha_y\)?

Side comment: in the future, you’ll be using different components of models for different purposes, to understand which is better, simulate it!

Q2 - Marketing

In question, we will highlight the impact of adding variables to your model.

Imagine the following distribution of variables:

You can imagine this as a simulation for Subscriptions are a function of someone’s Background, the number of Clicks is a function of someone’s Background whether they have been exposed to an Advertisement.

In general, companies often run randomized trials to estimate the impact of marketing campaigns. So it’s not unrealistic to imagine that people exposed to the advertisement are randomly chosen.

In general, you will not observe/measure someone’s background but only know of their online activity (e.g. clicks) and whether they are exposed to your advertisement. The goal for many businesses is for people to subscribe to their services.

Question 2.0

Most advertisement campaigns are run by the marketing teams, who are interested in estimating the impact of the campaign.

Question 2.1

Remember that you do not have access to the Background variable!

Question 2.2