Wayne's Github Page

A place to learn about statistics

HW4 - Data Cleaning

Please make sure your R is at least version 4 or higher. If you have an older version of R, just know that you may encounter factors when you construct data frames when others will not. This won’t impact your code significantly.

Q0 Practice with JOINS

When evaluating where to live after college, you may want to know whether JFK is “more connected” than ATL. One way to quantify “connectedness” is to see the total number of domestic airports that are reachable with a direct or connected flight.

We could achieve this using the unique_domestic_us_flights_2019.csv on CourseWorks. These files contain all the domestic flights where the origin and destination are US airports in 2019. We are going to restrict ourselves to use IATA airport codes for this assignment (e.g. JFK, SFO, etc).

Please show the code for the following:

Q1 Exploring data from Twitter

Tech companies often have ways for people to interact with their data to speed up innovation and discourage unwanted activities. Twitter is one such company where researchers like to use to analyze “popular sentiment”. Twitter users can share their thoughts/findings via short messages, known as tweets, that can be viewed by the public and people who subscribe to their twitter feed (a.k.a. followers).

In hw4_twitter.json, you’ll find data collected using the recent search API from Twitter. This contains tweets and additional information known as metadata.

This problem is meant to give you practice on exploring hierarchical data types. Use the following code to read in the data then answer the following questions.

library(jsonlite)
twitter_output <- read_json("hw4_twitter.json")

Q2 Data wrangling

Subsetting and summarizing data is much easier on rectangular data types like data frames so data wrangling often involves getting hierarchical data into a rectangular format that is Excel friendly.

Use the same data from Q1, please use 2 methods to create a data frame where the rows correspond to the different tweets and the columns correspond (in order) to different features:

The 2 methods should produce the same data frame but one should use a for-loop where the other should utitlize a combination of lapply() and do.call().

What you need to show for this problem:

Hints:

Q3 Why we bother with data wrangling

Please use one of the data frames from Q2 Please plot the scatter plot of the retweet count vs the like count while coloring different points according to whether a hashtag was used or not.