Wayne's Github Page

A place to learn about statistics

Numpy

Numpy is the package that enables fast matrix computation in Python.

Numpy has 2 main features, the numpy.array data type and several functions that are extremely efficient with numpy arrays.

Numpy array

You can think of a numpy.array as a matrix, with rows, columns, and each cell has a basic data type in it.

Numpy arrays can be created manually or sourced from a file.

A single numpy array can only hold one type of data

An important feature of numpy arrays is that each array can only hold a single data type, unlike lists/dictionaries/tuples that can hold multiple data types.

This seemingly restrictive choice allows numpy arrays to run very fast. To check the data type within the array we check the np.array.dtype attribute.

demo2.dtype

Another common attribute is the shape, i.e. the number of rows and number of columns of the array.

demo2.shape

Subsetting numpy arrays

Subsetting numpy arrays similarly uses the square brackets []. However, since we have rows and columns now, we subset for them together.

In the example below, we subset with integers, the first value indicates the row (3rd row), and the second value represents the column (2nd column).

demo1[2, 1]

Similar to lists, we can slice a segment of values using :.

demo1[1:, 1]
demo1[:1, 0]
demo1[:, 0]
demo1[0, :]

If we subset with boolean arrays, the boolean array provided for the row must have the same size as the number of rows. The columns work similarly.

# Creates a boolean array
small_col1 = demo1[:, 0] < 2
# Subset the rows using a boolean array
demo1[small_col1, :]

Special values

Two important special values in numpy are numpy.inf and numpy.nan.

The first is the infinite value, which operates as expected.

1 / np.inf
2 + np.inf

np.inf - np.inf

The second is the numpy.nan value which is commonly used for missing values. This value is unique becaues any value aggregated with it will become another numpy.nan value. This is useful because it’ll be clear when there are missing values in the data.

np.nan * 2
np.nan + 1

nan_demo = [2, np.nan, 5, -3]
# max/min ignores nans
max(nan_demo) # 5
# np.max will not ignore nans
np.max(nan_demo) # nan

Common methods

We will demonstrate some of the common methods using the demos above.

Numpy functions

Numpy has a lot of mathematical functions built-in to it.

Applying functions along a column or row

A common calculation with tabular data is to calculate the same summary statistic for each column or row in the data. Rather than writing a loop, numpy.apply_along_axis() allows a concise way to do this.

Below we calculate the sum for each column, i.e. along the rows. To remember which axis is 0, just recall that when calling np.array.shape that the first value, i.e. 0’th index, corresponds to rows.

import numpy as np

col_sum = np.apply_along_axis(np.sum, 0, demo)
print(col_sum)

If we changed the axis to 1 then we would have obtained the row totals instead.

Broadcasting

Broadcasting is how numpy distributes calculations across arrays that can be of different sizes. This is often orders of magnitude faster than running loops. A common usecase is to calculate pairwise distances between two arrays.