Homework 5
The goal of this homework is to give you some practice with text data.
Warren Buffet is considered one of the most successful investors and he writes an annual letter for his clients that is widely read by most people. The Gates Foundation, one of the largest non-profits, on the otherhand has an annual letter from Bill and Melinda Gates that touches on a variety of topics.
Both letters are in the files buffet_letters.json
and gates_letters.json
on Canvas respectively. Any consecutive non-alphanumeric values have been replaced
by a single space character, e.g. “Hello, today I’ll teach.” will become “Hello today I’ll teach.”
Q0: Pre-process the data
At the end of this section, we want a term frequency matrix, i.e. a data frame where each row corresponds to a letter and each column corresponds to a lemmatized token. Here are a few steps we would like you to include:
- Lowercase all of the words
- Replace anything that satisfies the following regular expressions with the corresponding word:
"([^0-9])(19|20)[0-9]{2}([^0-9])"
->"YEAR"
"\\$?[0-9]+[,\\/\\.]?[0-9]*"
->"NUMBERS"
- Use the
udpipe
package to lemmatize and tokenize the sentences in the letter. You can get sentences simply by separating the letters by\n
before usingudpipe_annotate()
- If you have memory issues here, try the package
textstem
instead. You may choose your own tokenization package as well if you choose this route.
- If you have memory issues here, try the package
- (optional) You may perform other processes if you believe your computer is hitting some memory issues e.g. dropping any word with 2 or fewer characters, handling links like we handled years, etc
- Count the occurrences of each lemmatized token within each letter
Please report the dimension of your data frame at the end and print out the histogram that shows the distribution of word presence across all words, e.g. “be” likely appears in 100% of the letters, “GEICO” may appear in 10% of the letters, … the histogram should show us these percentages (or counts) across all words.
Q1: Calculating TF-IDF
Please calculate the TF-IDF matrix using the definition of term frequency as “the occurrence of a token normalized by the occurrence of the most frequently used token in the letter”. Please use \(log(\frac{N}{n_t})\) as your inverse document frequency where \(N\) is the number of letters and \(n_t\) is the number of letters that contain the term \(t\).
Please calculate, for each token, its maximum TF-IDF value across all letters and assign this to a variable named mtfidf
.
Please plot the histogram of mtfidf
along with a vertical red line.
The red line should be largest value within mtfidf
that corresponds to a word listed under the stopwords package.
Please also report the number the stopwords you used to calculate the red line value above and the number of tokens
within mtfidf
values less than or equal to the red line value.
Please write 1 or 2 paragraphs on how we should handle the tokens that have mtfidf
values below the red line value yet
are not considered “stopwords” by the package.
Q2: Clustering text
Please choose at least the top 60 tokens based on your TF-IDF matrix above. This is left intentionally open for you to explore different options.
Using your TF-IDF matrix, please cluster the letters using k-means clustering and hierarchical clustering one of the linkage methods, e.g. “single”, “complete”, or “ward.D”.
Please pick one method and try to explain the resulting clustering either visually, algorithmically, or using simple statistics. You can quote news articles or wikipedia pages to support your findings as well. Your resulting cluster should AT LEAST separate out the Gates vs Warrn Buffet letters from each other.
Q3: Final Project
This section is mostly graded on completion with coherent sentences.
- Please summarize the goal of your data mining final project using 3 or fewer sentences.
- Please list one or more individuals or entities who would, under ideal circumstances, change their behavior or be willing to pay for the results of your data mining efforts.
- Please produce one visualization using one of the datasets for your final project.
- Please articulate one obvious pattern you expect to see in your data, i.e. something that would not be considered an insight.