Wayne's Github Page

A place to learn about statistics

APIs and Scraping

APIs

Although many websites allow you to download data with a single click on a browser, the ability to automatically query data without a single click requires some special tooling.

This requires us to understand how to interact with APIs (Application Programmable Interface). APIs allow people to interact programmatically with services provided by companies and government agencies, a critical component for automation.

The dominant python package for interacting with APIs is called requests. But before we introduce requests, we need to cover some basics about APIs.

Elements of an API

The most common task is to query data from APIs. To do this, your program needs to know: where to get the data and how to specify the data you want.

As an example, we will use the newsapi.org as an example. This is a website that aggregates different headlines from different papers. It gives you a few “free” calls but will block you after a certain threshold (freemium business model).

If you look at the documentation for NewsAPI.org, there are 3 different urls, under “Endpoints”, for 3 different applications: top headlines, everything, and sources. If you click on “Top headlines”, you’ll see a link, https://newsapi.org/v2/top-headlines?country=us&apiKey=API_KEY after the word “GET”. newsapi endpoint screen shot

In this link, there are several components:

Using requests to call the APIs

Now we know the elements, we will use requests.get() to get the data similar to the example shown on the NewsAPI.org Top Headline endpoint

import requests

URL = "https://newsapi.org/v2/top-headlines"
params = {
    'country': 'us',
    'apiKey': 'THE_KEY_YOU_GET'}
response = requests.get(url=URL, params=params)

response.url
response.status_code

newsapi_data = response.json()
type(newsapi_data)
newsapi_data.keys()

What to notice?

Common errors in calling APIs

Scraping

Most people are familiar with websites displaying information to them via their browser. To extract the information directly off the webpage is called scraping. We will only cover the most basic case where we parse the text information from the HTML information.

For an example, we’re going to obtain all the faculty’s names from Columbia’s Statistics department. Columbia Statistics Faculty Names

Specifically, we will try to get the bolded name like “Wayne T. Lee” and not the “Lee, Wayne T.” that is in an awkward ordering.

Using the inspector tool

As of 2020, most browsers have an inspector tool: browser inspector tools

The Safari “Develop” tab to show up by default, please Google how to get it to surface.

The inspector tool allows us to see what part of the code correspond to what part of the webpage. Try moving the curosr around different parts under “inspector”, you should see different parts of the webpage being highlighted. cursor inspector highlights

At this point, we normally want to click into the different <div> tags (in Firefox, there are dropdown arrows next to the <div> tags that hide the details within the tag) until we see only the text of interest in highlighted. These tags similar to the comma in a CSV file where they format the data so the browser knows how to display the content. div tag with name

The information from the navigation will inform us how to write the code later.

A quick note about HTML format

In a loose sense, HTML is a data format for webpages (look up XML for the more proper data format) structure, i.e. heading, sidebars, tables, etc.

Here are a few quick facts you need to know

Calling the webpage using requests in Python

The HTML in the inspector tool can be obtained by using the requests.get() function.

import requests
URL = "http://stat.columbia.edu/department-directory/faculty-and-lecturers/"
web_response = requests.get(url=URL)
type(web_response)
print(web_response.url)
print(web_response.headers)
print(web_response.status_code)

What to notice?

To get the same text you saw in the inspector tool, you need to look at web_response.text to help convert the binary data into text.

page_html = web_response.text

Parsing the webpage using beautifulsoup4

The text is an extremely string. Its format is HTML which is a close cousin to XML. Here we will need the package beautifulsoup4 to help us parse this string.

To install beautifulsoup4, you may need to specify a conda channel (here we use conda-forge, a relatively popular one) to do so. This command should be typed into the Terminal/Command Line/Anaconda Prompt.

conda install -c conda-forge beautifulsoup4

To use the package, it’s oddly named bs4 instead.

from bs4 import BeautifulSoup
html_tree = BeautifulSoup(web_response.text, 'html.parser')
type(html_tree)

Again we have a special class of data. The HTML data is often thought of as a “tree” where the most outer layer (e.g. <body> ... </body> in your html) is the root node and you can dig into the child-nodes.

To get the names we will run the following:

name_node = html_tree.find_all("h3", attrs={'class': "cn-entry-name"})
len(name_node)
name_node[0]

To complete the task:

names = [h3_node.get_text() for h3_node in name_node]
len(names)
print(names[:10])

Side comment: scraping takes a lot of trial-and-error, do not worry if this took awhile to do!

The “Network” under Inspector and why scraping is discouraged

Return to the Inspector Tool, you should see a tab called “Network”.

Network screen shot

Take-home messages:

Some context on scraping

Here are some things to know about scraping:

Scheduling jobs with cron

Many types of data collection require you to constantly run a particular task over time. In our context, you may want to get the news articles from newsapi.org every day. You then need to schedule your Python code to run at a particular time every day.

The most basic tool for this is called cron which exists across operating systems.

The overall flow is to edit a file (on Macs it’s called crontab). In this file you will specify the frequency, the program to run, and the script to run. You need to be careful with the permissions around different scripts because cron will be running the code as a user that is not the same as your login.