Wayne's Github Page

A place to learn about statistics

Pandas - Handling tabular data with mixed data types

numpy.array forces the data type for every value to be the same, this creates issues when the data has both numeric and categorical information (e.g. sex, occupation etc).

Although all algorithms ultimately only work with numeric values, e.g. by converting categories into counts or a collection of 0-1 columns, the data is rarely ever encoded in that format. For example, the variable gender is commonly coded as M, F, Other, NA. Then most algorithms will create multiple columns/variables like gender_is_M, gender_is_F, gender_is_Other where each is 0 or 1 depending on the actual value of gender (gender_is_NA is often not created for statistical reasons).

The 0-1 flags, however, are hard to explore since users often handled the values jointly. This becomes very awkward when there are multiple categorical variables that have different number possible values. Given these usecases, a new data type was necessary since numpy’s speed depends greatly on the single data type constraint.

pandas is the package that provides a data type pandas.DataFrame that can solve this issue! To introduce the data frame, we’ll use the train.csv from the Titanic dataset on Kaggle.com as an example.

import pandas as pd

df = pd.read_csv("train.csv")

Pandas data frame Attributes

A pandas.DataFrame is similar to a numpy.array in it often has 2 dimensions.

Subsetting a Data Frame

Similar to numpy, we generally subset data frames by columns and rows. However, since data frames have additional features, they need to be subsetted using particular methods with different types of data.

Each column is a pandas.Series

We hinted that each column in a data frame is a pandas.Series. These work similarly to numpy.array where all values share a data type and methods exist to calculate summary statistics on these.

df.Fare / 10            # returns a pandas.Series with each value divided by 10
df.Fare.mean()          # calculates the mean
df.Age.isna().mean()    # calculates the frequency of NaN values
df.Sex.value_counts()   # calculates the frequency of each possible value
df.Fare.argmax()        # returns the row number with the largest value
df.Age.fillna(df.Age.mean()) # Replaces all the NaN values with the average

Many of these methods also apply to the data frame although this is generally discouraged because data frames often have mixed data types.

Creating a Data Frame

There are 2 main ways you’ll get a pandas data frame:

Merging/Joining 2 Datasets

Sometimes we have information across 2 different datasets but we want combine them, e.g. combining the geographic information of a city and the health statistics of a city allows us to plot the health statistics on a map for easy comparisons.

This operation is often called a “join”, a word from SQL. The records that share a key across the two datasets will be merged. In pandas, the parameter on lets us indicate which columns the datasets will be merged on. To demonstrate the different merges, we will show use the following example, notice city is shared across both datasets but the values under city do not fully overlap.

a = pd.DataFrame([{'city': 'new york', 'lon': 100},
                  {'city': 'san francisco', 'lon': 120}])
b = pd.DataFrame([{'hospital': 'A', 'city': 'new york', 'beds': 90},
                  {'hospital': 'B', 'city': 'new york', 'beds': 20},
                  {'hospital': 'A', 'city': 'austin', 'beds': 210}])

Applying a function across rows or columns

Similar to numpy, we can “apply” the same function across multiple rows or columns without explicitly calling a for-loop.

Grouping

A rather handy feature with pandas is its ability to partition the data frame into groups according to certain features. For example, we can group the data by Sex and Pclass then perform certain calculations.

df_grp = df.groupby(['Sex', 'Pclass'])
df_grp.Survived.mean()
# Sex     Pclass
# female  1         0.968085
#         2         0.921053
#         3         0.500000
# male    1         0.368852
#         2         0.157407
#         3         0.135447

For more custome calculations, df_grp.groups is an attribute that is a dictionary containting the group label as the key and the index of all recorsd belonging to the group.

for grp, inds in df_grp.groups.items():
    # Using the groupby object method
    surv_rate = df_grp.get_group(grp).loc[:, 'Survived'].mean()
    # Using the indices
    surv_rate2 = round(df.loc[inds, 'Survived'].mean() * 100, 2)

    print('Survival rate for group with Sex {} and Pclass {} is {}%'.format(
        grp[0], grp[1], round(100 * surv_rate, 2)))
    print(f'Survival rate for group with Sex {grp[0]} and Pclass {grp[0]} is {surv_rate2}')
        
     

Class specific methods

There are many data-type-specific methods in pandas that help with the readability of yuor code.

Let’s use the a and b dataset from the merging above.

Following a Series, with dtype as str, with .str will allow you to use many different string methods.

# To see if the character 'new' or 'san' is in 
m = a.merge(b, on='city', how='outer')
print(m.city.str.lower())
print(m.city.str.lower().str.contains('new|san'))

# notice that `.contains()` is NOT a standard string method
demo_str = 'hello'
demo_str.contains('ll')

Following a Series, with dtype as datetime, with .dt will allow you to use many different datetime methods.

m['date'] = pd.to_datetime(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01'])
print(m.date.dt.month)
print(m.date.dt.day_of_week)