Wayne's Github Page

A place to learn about statistics

Homework 5

The goal of this homework is to give you some practice with text data.

Warren Buffet is considered one of the most successful investors and he writes an annual letter for his clients that is widely read by most people. The Gates Foundation, one of the largest non-profits, on the otherhand has an annual letter from Bill and Melinda Gates that touches on a variety of topics.

Both letters are in the files buffet_letters.json and gates_letters.json on Canvas respectively. Any consecutive non-alphanumeric values have been replaced by a single space character, e.g. “Hello, today I’ll teach.” will become “Hello today I’ll teach.”

Q0: Pre-process the data

At the end of this section, we want a term frequency matrix, i.e. a data frame where each row corresponds to a letter and each column corresponds to a lemmatized token. Here are a few steps we would like you to include:

Please report the dimension of your data frame at the end and print out the histogram that shows the distribution of word presence across all words, e.g. “be” likely appears in 100% of the letters, “GEICO” may appear in 10% of the letters, … the histogram should show us these percentages (or counts) across all words.

Q1: Calculating TF-IDF

Please calculate the TF-IDF matrix using the definition of term frequency as “the occurrence of a token normalized by the occurrence of the most frequently used token in the letter”. Please use \(log(\frac{N}{n_t})\) as your inverse document frequency where \(N\) is the number of letters and \(n_t\) is the number of letters that contain the term \(t\).

Please calculate, for each token, its maximum TF-IDF value across all letters and assign this to a variable named mtfidf.

Please plot the histogram of mtfidf along with a vertical red line. The red line should be largest value within mtfidf that corresponds to a word listed under the stopwords package.

Please also report the number the stopwords you used to calculate the red line value above and the number of tokens within mtfidf values less than or equal to the red line value.

Please write 1 or 2 paragraphs on how we should handle the tokens that have mtfidf values below the red line value yet are not considered “stopwords” by the package.

Q2: Clustering text

Please choose at least the top 60 tokens based on your TF-IDF matrix above. This is left intentionally open for you to explore different options.

Using your TF-IDF matrix, please cluster the letters using k-means clustering and hierarchical clustering one of the linkage methods, e.g. “single”, “complete”, or “ward.D”.

Please pick one method and try to explain the resulting clustering either visually, algorithmically, or using simple statistics. You can quote news articles or wikipedia pages to support your findings as well. Your resulting cluster should AT LEAST separate out the Gates vs Warrn Buffet letters from each other.

Q3: Final Project

This section is mostly graded on completion with coherent sentences.