Wayne's Github Page

A place to learn about statistics

Getting data

So far all the problems assumed that the data came from an existing file. But data can come directly from the internet without ever saving to a file. In fact, this is the most common way how commercial venues exchange data.

In practice, there are 2 main ways to obtain data:

Scraping a website

Most people are familiar with websites displaying information to them via their browser. To extract the information directly off the webpage is the act of scraping. We will only cover the most basic case where we parse the text information from the HTML information.

For an example, we’re going to obtain all the faculty’s names from Columbia’s Statistics department. Columbia Statistics Faculty Names

Specifically, we will try to get the bolded name like “Wayne T. Lee” and not the “Lee, Wayne T.” that is in an awkward ordering.

Using the inspector tool

As of 2020, most browsers have an inspector tool: browser inspector tools

The Safari “Develop” tab to show up by default, please Google how to get it to surface.

The inspector tool allows us to see what part of the code correspond to what part of the webpage. Try moving the curosr around different parts under “inspector”, you should see different parts of the webpage being highlighted. cursor inspector highlights

At this point, we normally want to click into the different <div> tags (in Firefox, there are dropdown arrows next to the <div> tags that hide the details within the tag) until we see only the text of interest in highlighted. These tags similar to the comma in a CSV file where they format the data so the browser knows how to display the content. div tag with name

The information from the navigation will inform us how to write the code later.

A quick note about HTML format

In a loose sense, HTML is a data format for webpages (look up XML for the more proper data format).

Here are a few quick facts you need to know

Calling the webpage using R with httr

The HTML in the inspector tool can be obtained by using the GET() function in the library httr.

library(httr)
url <- "http://stat.columbia.edu/department-directory/faculty-and-lecturers/"
web_response <- GET(url=url)
class(web_response)
names(web_response)
web_response$url
web_response$date
web_response$status_code

What to notice?

To get the same text you saw in the inspector tool, you need to use content() to help convert the binary data into text.

web_content <- content(web_response, as='text')
class(web_content)
substr(web_content, 1, 100)
length(web_content)
nchar(web_content)

Parsing the webpage using xml2

The converted text is extremely long despite being one single string. Its format is HTML which is a close cousin to XML. Here we will need the package xml2 to help us parse this string.

library(xml2)
html_tree <- read_html(web_content)
class(html_tree)

To get the names we will run the following:

name_node <- xml_find_all(html_tree, xpath='//h3[@class="cn-entry-name"]')
length(name_node)
name_node[[1]]

To complete the task:

name_link_node <- xml_children(name_node)
length(name_link_node[[1]])
all_names <- xml_text(name_link_node)
class(all_names)
all_names

Side comment: scraping takes a lot of trial-and-error, do not worry if this took awhile to do!

Exercise

From the special classes in httr and xml2 to our native R data types

Special data types are common in object oriented programming languages (which R supports) so sometimes you’ll see odd data types pop up.

These are created often to facilitate operations specific to the type of data at hand (e.g. parsing responses from websites, navigating XML graphs). Packages/libraries will often have custom functions that interact with these special data types (e.g. content(), xml_children(), etc).

My general recommendation is to:

Side comment: custom data types and functions are common in something called “object oriented programming” (OOP) as opposed to function programming (where the data types are limited but the number of functions is vast). You should take a CS course if you’re curious about these 2 philosophies.

The “Network” under Inspector and why scraping is discouraged

Return to the Inspector Tool, you should see a tab called “Network”.

Network screen shot

Take-home messages:

Some context on APIs

Here are some of the flaws of scraping:

One way to solve this, is to leverage the company’s APIs:

Elements in an API call

Calling an API can be done via the GET() function as in scraping! The common difference is that we will often pass multiple arguments along with our call to specify who we are and what we want.

As an example, we will use the newsapi.org as an example. This is a website that aggregates different headlines from different papers. It gives you a few “free” calls but will block you after a certain threshold (freemium business model).

If you look at the documentation for NewsAPI.org, there are 3 different urls, under “Endpoints”, for 3 different applications: top headlines, everything, and sources. If you click on “Top headlines”, you’ll see a link, https://newsapi.org/v2/top-headlines?country=us&apiKey=API_KEY after the word “GET”. newsapi endpoint screen shot

On the page above, the elements to make an API call are:

Using httr to call the APIs

Now we know the elements, we will use httr to get the data similar to the example shown on the NewsAPI.org Top Headline endpoint

library(httr)
url <- "https://newsapi.org/v2/top-headlines"
params <- list(
    country="us",
    apiKey="THE_KEY_YOU_GET")
response <- GET(url=url,
                query=params)
response$url
response$status_code

newsapi_data <- content(response)
class(newsapi_data)
names(newsapi_data)

What to notice?

Common errors in calling APIs

Wrapping up

Hopefully you see how you could get more data from different websites in a programmatic fashion, i.e. not clicking on Donwload manually.

I recommend you to try getting data via different APIs from different companies. Here are a few places to start: