Wayne's Github Page

A place to learn about statistics

Prerequisites - Introductory Statistics

You should understand the role statistics plays in the scientific method.

The overall logic is “similar” to proof by contradiction, we assume a statement is true, then we collect data. If the data (reality) is really unlikely under the assumption, we reject the assumption instead. We do this because it’s easier to disprove things but very hard to prove anything is correct in practice.

Observations via summary statistics or graphs

When observing data, we often visualize different types of data differently, e.g. histograms for quantitative vs bar charts for qualitative data. But ultimately we care about the distribution of the data, i.e. the different possible values and the respective relative frequencies (we care about the absolute frequency if the least popular category is small).

An example of distributions are:

Feature Significance of feature common statistics used to quantify the feature
Uni-modal vs multi-modal distribution multi-modal distributions could imply multiple populations, e.g. children and adults, in the same dataset. Most summary statistics below often assume the distribution is uni-modal. This is often eye-balled from graphs of the data.
Location Where is data, e.g. are regrets with college majors around 10% (not bad) or 50% (yikes!) mean, median, mode
Spread Relative to the location statistic, how spread-out/variable is the data? e.g. 50% regret \(\pm\) 30% is very different from 50% \(\pm\) 5% standard deviation, inter-quartile range (IQR), range
Skew Are data values symmetrically distributed about the location statistic? Long tails imply a small fraction of values are vastly different from the majority This is often eye-balled in introductory statistics

How can there be multiple values that serve the same purpose, i.e. different quantities that reflect the same feature in a distribution? This is because different statistics have different properties, these properties result from the different calculations, and for different problems we desire different properties.

The most popular property is the sensitivity to extreme values in the distribution. Means are sensitive to values in the tail where medians are less so. An example is income where adding a billionaire to your neighborhood would increase the mean income a lot but not impact the median income of your neighborhood.

This sensitivity is not always bad, different problems may want to be sensitive to the tail. For example, housing prices are similar to income data where there is a long right tail (this is a feature in the data). A policy maker would be interested in the median housing price to understand the typical cost of living (i.e. they wouldn’t want their location statistic to be sensitive to outliers) where a realtor would be interested in the mean housing price because they earn commission as a fraction of the selling price and one big sale could suffice for one’s annual income (i.e. they want to be sensitive to the presence of rich buyers).

The spread statistics allow us to know how much data is within a certain range. For example, the fraction of data that is \(k\) SD’s away from the average is upper-bounded by \(\frac{1}{k^2}\) . So there can be at most \(\frac{1}{2^2}\) , i.e. 25% of the data that is away from the average beyond 2 SDs (see Chebyshev’s inequality). The amount of data within one IQR from the median is at least 50%. In some sense, it allows us to know the quality of the location statistic.

Overall, these statistics are helpful because they give us a picture of the distribution of the data without plotting and later provide a concrete number that we can measure change.

Formulating a hypothesis

A hypothesis is an explanation of the observed data that can be challenged/falsified.

For example, if we observe 2 modes in the measured weights of Americans, several explanations could explain the outcome:

The typical textbook hypothesis surrounds a causal question like “does vitamin C work?” This often is phrased into hypothesis like: “Providing daily dosage of X mg of Vitamin C does not decrease the average number of sick days taken for the common cold for people in the age between 18 and 65. Any difference observed is due to individual differences and the random assignment.”

This statement has several elements:

It’s important to know that random assignment is only one possible explanation.

These considerations are helping us to formalize our intuition. Specifically, we define “work” by choosing one measurable outcomes for health (a simplification). We also provide an alternative explanation for the difference using random assignment (another simplification that ignores other possibilities).

Collecting the relevant data

There are two main questions in introductory statistics:

The framework to think about sampling bias is through the example of a political poll.

Terminology Definition Example via Political Polling
population who you want to study Anyone who can vote in the next election
the sampling frame who you could possibly contact Those with a phone number from your country
the sample (who responded) who you actually contacted and responded Those who you sampled and answered your questions
non-response who you contacted but did not respond Those who you sampled but did not respond to your phone call or answer your questions
illegible who you sampled, answered your questions, but not who you want to study Foreigners with a domestic phone number who answered the poll
Outside the sampling frame who you want to study but are cannot contact Voters without a phone number

An example of a sampling bias is if we contacted the top 1000 phone numbers in our sampling frame, since the numbers are likely ordered in a particular fashion (e.g. Alabama first), and states have strong political leanings, our sample would not be representative. Sampling the phone numbers turns this possible bias into a chance-like error (notice that it’s still possible to sample the first 1000 phone numbers in the sampling frame).

To think about confounding, we must understand the domain and alternative factors that contribute to the outcome differences. The common issue studying diets is whether life style is a confounding factor for the outcome, i.e. is life style (e.g. exercise vs not) instead of diet causing the differences we see in the data.

In other words, the reason we cannot simply study people who are on a diet today vs not and measure their outcomes is that confounding variables could explain someone’s behavior and their health outcome rather than the diet having an impact on someone’s health. People with healthy life styles tend to diet (e.g. no sugary drinks) and also have better health outcomes. Attributing the outcomes to the diet is ignoring the alternative explanation from the life style.

There are two ways to address this confounding issue, random assignment and matching. Random assignment places people into the treated vs control group randomly, regardless of their confounding variable (e.g. healthy or non-healthy life styles). Matching on the other hand, tries to match people with similar life styles then assign one to treatment and the other to control.

In practice, matching is difficult because it’s difficult to know and list the confounding factors upfront. In practice, we want to make sure we have no known confounders and the random assignment addresses the unknown confounders like genetics by turning the bias into a chance-like event.

A nice framework is to think about 3 sources of variability that contribute to data:

After converting bias to chance-like error, then what?

In the ideal world, the data would only have the influence from the treatment and nothing else. In this case we would only need one sample from the treatment and control group!

This, however, isn’t realistic because there are always individual differences with the subjects, random measurement error, etc. Even genetically identical mice can have different outcomes.

To deal with chance-like error in the data, we rely on aggregate statistics on large samples where the statistics behave more like constants relative to the individual data points. In other words, a statistic on a collection of people is more predictable than a measurement from an individual. Intuitively you may think that the individual differences will be mostly averaged out and only the treatment difference is the only thing consistently different between the treated and control group.

In classic introductory textbooks, how to calculate the necessary sample size is an important question (especially for grant writing since recruiting subjects is often a major source for cost). The question then becomes, what sample size will make the statistic “predictable enough”, i.e. its standard error (SE) is small enough.

Small enough is often defined half of your expected treatment effect, i.e. if you believe a new drug will make your smarter by 3 points, you want your SE to be around 1.5 points. Since introductory statistics only deals with averages, we only have two formulas:

In the formulas above, \(n\) is the variable we need to solve, half of the treatment effect (e.g. our 1.5 points) should come from understanding the domain, and \(\sigma\) should come from some pilot study that understands the general variability and reliability in the measurements.

The calculation above, however, does not factor in statistical power, this is a concept we’ll cover below. In practice, statistical power is often what sets the sample size.

Where does this “half” come from? It’ll come from the fact that the average follows a Normal distribution, this is true because something called the Central Limit Theorem, which states that averages, with a sufficient sample size, will follow a Normal distribution. What is a sufficient sample size depends on the data and is not the same.

A distribution is, again, the possible values and their respective frequency. A Normal distribution has the property:

\(k\) \(P(-k SD\leq\)Normal Random Variable\(- Avg \leq k SD)\)
1 \(\leq 0.68\)
2 \(\leq 0.95\)
3 \(\leq0.997\)

This \(k=2\) case is where the “half” above came from because we set “Effect Size = 2 * SE”, so “SE = Effect Size / 2”.

Analyzing the data

Hopefully you suspected that the work for analyzing the data has already been done in the previous step. You’ve determined the data to collect, how much to collect, and how you will make a decision. This part is mostly about executing those plans and some verification.

In introductory statistics, we rarely talk about the verifications needed to be done but a simple one is whether the data is what you expected? For example, if your psychology experiment has a lot more women than men, you might question the sampling procedure. Failing these verifications often means you should re-assess your data collection and delay any decision making.

But assuming the verification goes by smoothly, the analysis steps often just involve calculating the summary statistics: averages, SD, then calculating the z-test-statistic, i.e. the effect normalized by the variability in the data:

With our large samples, the z-test-statistic should follow a Normal(0, 1), i.e. Normal distribution with mean=0 and SD=1. Therefore, if the z-test-statistic is beyond 2, we would find that to be surprising under the null hypothesis, specifically, we’ve encountered an event, where seeing our data or something more extreme, only happens less than 5% of the time. The specific probability here is called the p-value.

Decision

The p-value can be thought as a surprise level (the smaller the more surprised), to make a decision, we need your risk tolerance for being wrong. There are 2 ways of being wrong:

It turns out that following the hypothesis testing procedure, i.e. rejecting the null when the p-value is less than 5% will ensure your type 1 error is at most 5%. So the p-value is compared to your Type 1 error tolerance, i.e. your significance level \(\alpha\). If the p-value is lower than \(\alpha\), we reject the null hypothesis, otherwise we fail to reject the null hypothesis.

However, most tests fail to reject the null hypothesis. This phrasing instead of “accepting the null” is to encourage us to think about the other reasons we might fail to reject:

To control for type 2 error, we often rely on something called statistical power. This is the probability of correctly rejecting the null when the alternative hypothesis is correct, P(Seeing our z-test-statistic or more extreme | Alternative Hypothesis is True). Calculating this requires us to create a minimum detectable effect (MDE) from the domain science, this is often the same or slightly smaller than the expected treatment effect. We then calculate what’s the chance of rejecting the null when the alternative is true.

It’s important to note that the Type 1 and Type 2 errors here are referring to errors over many experiments, not the individual experiment being conducted. So the hypothesis testing framework is only appropriate for decisions that will happen many many times.

Other topics

Introductory statistic courses often also teach:

In regression we won’t talk much about sampling but you should know that the absolute sample size (not the sample size relative to the population) and proper randomization are keys to a quality sample.