Pandas - Handling tabular data with mixed data types
numpy.array
forces the data type for every value to be the same, this creates issues
when the data has both numeric and categorical information
(e.g. sex, occupation etc).
Although all algorithms ultimately only work with numeric values, e.g. by
converting categories into counts or a collection of 0-1 columns, the data is rarely ever encoded in
that format. For example, the variable gender
is commonly coded as M
, F
, Other
,
NA
. Then most algorithms will create multiple columns/variables like
gender_is_M
, gender_is_F
, gender_is_Other
where each is 0 or 1 depending
on the actual value of gender
(gender_is_NA
is often not created for statistical
reasons).
The 0-1 flags, however, are hard to explore since users often handled the values jointly.
This becomes very awkward when there are multiple categorical variables that have different
number possible values. Given these usecases, a new data type was necessary since
numpy
’s speed depends greatly on the single data type constraint.
pandas
is the package that provides a data type pandas.DataFrame
that can solve this issue!
To introduce the data frame, we’ll use the train.csv from the Titanic dataset on Kaggle.com
as an example.
import pandas as pd
df = pd.read_csv("train.csv")
Pandas data frame Attributes
A pandas.DataFrame
is similar to a numpy.array
in it often has 2 dimensions.
- The columns often represent the different features/variables in the data. These
are often labeled by the column names (called
columns
) which is often provided by the first row in a CSV file.df.columns # Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', # 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], # dtype='object')
- The rows often represent different records in the data. These are often labeled
by something called
index
which defaults to the row number in the CSV file.df.index
- To get the size of the data frame, pandas preserved the same attribute as
numpy
df.shape
- We can also access any of the columns (all rows) as an attribute. This will return
a
pandas.Series
surv = df.Survived
Subsetting a Data Frame
Similar to numpy
, we generally subset data frames by columns and rows.
However, since data frames have additional features, they need to be subsetted
using particular methods with different types of data.
pandas.DataFrame.head
, allows us to look at the first few rows of data.df.head()
pandas.DataFrame.loc
, which allows us to subset using boolean Seriescheap_tickets = df.Fare < 5 cheap_df = df.loc[cheap_tickets, ]
pandas.DataFrame.iloc
, which allows us to subset by the row position and column position. Here we will grab the first 3 rows and first 3 columns from the data frame we got above. Notice that theindex
(179, 263, and 271) retained the values fromdf
instead of restarting at 0 again.cheap_df.iloc[:3, :3] # PassengerId Survived Pclass # 179 180 0 3 # 263 264 0 1 # 271 272 1 3
pandas.DataFrame.loc
also allows us to subset using index values and column names.cheap_df.loc[:263, ['Survived', 'PassengerId']] # Survived PassengerId # 179 0 180 # 263 0 264
Notice that
:263
which usually subsets for the first 263 records, now returned all the records until theindex
value that corresponded to 263. In the data framecheap_df
, this returned only 2 rows. This shows that theindex
may not always correspond to the row number.
Each column is a pandas.Series
We hinted that each column in a data frame is a pandas.Series
. These work similarly to
numpy.array
where all values share a data type and methods exist to calculate
summary statistics on these.
df.Fare / 10 # returns a pandas.Series with each value divided by 10
df.Fare.mean() # calculates the mean
df.Age.isna().mean() # calculates the frequency of NaN values
df.Sex.value_counts() # calculates the frequency of each possible value
df.Fare.argmax() # returns the row number with the largest value
df.Age.fillna(df.Age.mean()) # Replaces all the NaN values with the average
Many of these methods also apply to the data frame although this is generally discouraged because data frames often have mixed data types.
Creating a Data Frame
There are 2 main ways you’ll get a pandas data frame:
- Reading it from a file using
pandas.read_csv()
import pandas as pd df = pd.read_csv("demo.csv")
- Creating it from scratch
# A dictionary of lists df = pd.DataFrame({'a': [1, 2], 'b': [5, 6]}) # A list of dictionaries df1 = pd.DataFrame([{'a': 1, 'b': 5}, {'a': 2, 'b': 6}])
Merging/Joining 2 Datasets
Sometimes we have information across 2 different datasets but we want combine them, e.g. combining the geographic information of a city and the health statistics of a city allows us to plot the health statistics on a map for easy comparisons.
This operation is often called a “join”, a word from SQL. The records
that share a key across the two datasets will be merged. In pandas,
the parameter on
lets us indicate which columns the datasets will be
merged on. To demonstrate the different merges, we will show use the
following example, notice city
is shared across both datasets but
the values under city do not fully overlap.
a = pd.DataFrame([{'city': 'new york', 'lon': 100},
{'city': 'san francisco', 'lon': 120}])
b = pd.DataFrame([{'hospital': 'A', 'city': 'new york', 'beds': 90},
{'hospital': 'B', 'city': 'new york', 'beds': 20},
{'hospital': 'A', 'city': 'austin', 'beds': 210}])
- Inner join (the default), will only merge records that exist in
both datasets. Notice that duplication will happen when there are
multiple matches.
a.merge(b, on='city') # city lon hospital beds # 0 new york 100 A 90 # 1 new york 100 B 20
- Outer join, will keep all records across both datasets. Missing
values will all be filled in with NaN.
a.merge(b, on='city', how='outer') # city lon hospital beds # 0 new york 100.0 A 90.0 # 1 new york 100.0 B 20.0 # 2 san francisco 120.0 NaN NaN # 3 austin NaN A 210.0
- Left join, will keep all records on the left, i.e. the data frame
that we call
merge()
on.a.merge(b, on='city', how='left') # city lon hospital beds # 0 new york 100 A 90.0 # 1 new york 100 B 20.0 # 2 san francisco 120 NaN NaN
Applying a function across rows or columns
Similar to numpy
, we can “apply” the same function across multiple
rows or columns without explicitly calling a for-loop.
- Calculating the number of unique values in each column
df.apply(lambda x: x.unique().shape[0], axis=0)
Here we used a
lambda
function, which is a function that we are passing to the functionpandas.DataFrame.apply()
without defining it properly withdef
. Ultimately, each column is treated asx
, and for each column we call the methodpandas.Series.unique()
chained by obtaining the attributepandas.Series.shape
. Noticeapply()
is called as a method on the data frame so the data used is the entire data frame. - Changing the
axis=0
toaxis=1
will apply the function on each row.
Grouping
A rather handy feature with pandas is its ability to partition the data frame into
groups according to certain features. For example, we can group the data by Sex
and
Pclass
then perform certain calculations.
df_grp = df.groupby(['Sex', 'Pclass'])
df_grp.Survived.mean()
# Sex Pclass
# female 1 0.968085
# 2 0.921053
# 3 0.500000
# male 1 0.368852
# 2 0.157407
# 3 0.135447
For more custome calculations, df_grp.groups
is an attribute that is a dictionary
containting the group label as the key and the index
of all recorsd belonging to
the group.
for grp, inds in df_grp.groups.items():
# Using the groupby object method
surv_rate = df_grp.get_group(grp).loc[:, 'Survived'].mean()
# Using the indices
surv_rate2 = round(df.loc[inds, 'Survived'].mean() * 100, 2)
print('Survival rate for group with Sex {} and Pclass {} is {}%'.format(
grp[0], grp[1], round(100 * surv_rate, 2)))
print(f'Survival rate for group with Sex {grp[0]} and Pclass {grp[0]} is {surv_rate2}')
Class specific methods
There are many data-type-specific methods in pandas that help with the readability of yuor code.
Let’s use the a
and b
dataset from the merging above.
Following a Series, with dtype as str
, with .str
will allow you to use many different string methods.
# To see if the character 'new' or 'san' is in
m = a.merge(b, on='city', how='outer')
print(m.city.str.lower())
print(m.city.str.lower().str.contains('new|san'))
# notice that `.contains()` is NOT a standard string method
demo_str = 'hello'
demo_str.contains('ll')
Following a Series, with dtype as datetime
, with .dt
will allow you to use many different datetime methods.
m['date'] = pd.to_datetime(['2020-01-01', '2020-02-01', '2020-03-01', '2020-04-01'])
print(m.date.dt.month)
print(m.date.dt.day_of_week)