Wayne's Github Page

A place to learn about statistics

Loading and writing data from and to files

Data is often stored in different file formats and you need different parsers to load these.

Sometimes the data may not be properly formatted and it’ll need to be read in like text.

Reading and writing files

The most important thing is to understand what format is best for your data.

Common plain text formats are:

Some fields have created standards around data and therefore their own data format. Shape files for geospatial data is such an example.

Some places, for efficiency and privacy, may have their own format that may not be stored in plain text. Protobuf from Google is a popular version of this where one needs a separate file to parse (deserialize) the data stream.

A package will likely exist

To read any data, you’ll likely need a package to help you read it.

With lower level functions, it’s important to know that reading from and writing to data often involves 2 steps.

  1. Creating a handle to accessing a file in a read or write mode
  2. Parsing the data stream from the file (read) or serializing the data into the file’s format (write).

Serialization just means to store the data in the file’s format, e.g. characters are surrounded by "", values are separated by ,, etc. There are more complicated versions when the data is not stored in plain text.

Here’s an example of how to read/write a JSON file.

import json

demo_handle = open('demo.json', 'r')
demo_json = json.load(demo_handle)
json.dump(demo_json, open('demo_copy.json', 'w'))

Here’s an example of how to read/write a CSV file.

import pandas as pd

demo_df = pd.read_csv("demo.csv")
demo_df.to_csv("demo_copy1.csv")

Another approach is using numpy which we will discuss in more detail soon.

import numpy as np

demo_array = np.loadtxt(fname="demo.csv", delimiter=",")
np.savetxt("demo_copy.csv", demo_array, delimiter=",")

Reading in plain text data that may be poorly formatted

Every now and then, we’ll have data that is poorly formatted. For example, the CSV file from the World Bank will lead to a parsing error for most programs.

import pandas as pd
df = pd.read_csv("API_8_DS2_EN_csv_v2_3471645.csv")
# ParserError: Error tokenizing data. C error: Expected 3 fields in line 5, saw 66

To debug the problem, we could open the file with a text editor (e.g. Note Pad on Windows) or let Python read the file line by line to debug the issue.

with open("API_8_DS2_EN_csv_v2_3471645.csv", "r") as f:
    txt = f.readlines()

txt[:5]

Reading the first 5 lines reveals that the first 4 lines contain metadata (data title and last updated date), breaking the usual CSV format. Knowing this error, we can tell our usual functions to skip the first 4 lines.

df = pd.read_csv("API_8_DS2_EN_csv_v2_3471645.csv", skiprows=4)

In worst cases, we would need to parse the data line by line.