Importing Packages
If you haven’t noticed yet, Python’s default functionality is quite minimal.
For example, log()
which is commonly available in most calculators is not
available by default. You need to source the function from a package.
Using built-in packages
Continuing the log()
example, there are 2 ways to obtain access to log()
:
- Import the package, then call the function as if it’s a method in the package.
import math math.log(1)
- Import the function
from math import log log(1)
The first method is useful when you’re developing your code since you may not know all the functions you’ll need upfront. The latter is common when you only need a specific function from the package and the package is relatively large, e.g. sklearn.
Popular functions
collections.Counter
is useful for counting the frequency of different things, e.g. word count in an article, then behaves similarly to a dictionary.cnt = Counter(['a', 'b', 'a']) cnt['a']
itertools.product
is useful when creating all possible combinations between two lists of variables. This often turns a double loop into a single loop.for i, j in product([1, 2], [4, 5]): print(i + j)
itertools.chain.from_iterable
is useful to flatten nested lists (or other containers!)from itertools import chain x = [[1, 2], [3, 4, 5]] x_flat = chain.from_iterable(x) list(x_flat)
Using custom packages
There is no difference in how to access the functions from a custom package vs built-in package except in how the package is retrieved.
Built-in packages come with the Python installation. Custom packages require an additional step that could create problems when working across projects.
Packages may depend on different versions of the same package, creating a “dependency conflict”. This can cause major pains if not handled carefully. For example, installing a package may force you to downgrade another package which may lead to your old code crashing.
Familiarizing yourself with a package manager will come in handy in the future. conda is popular with the scientific community, providing additional features like virtual environments, allowing you to have different environments with different dependencies.
pip is also quite popular but you would need to pair it with virtualenv. This is more programmer friendly.
Most data science packages can be installed easily through Anaconda.
For those unfamiliar with the command line, the Anaconda Navigator may be a better start but this isn’t well supported given most data practictioners are reasonably comfortable with the command line.
Recommended general flow
- Keep your base environment relatively simple, i.e. do not install too many packages in your default Python environment.
- For each project, have an isolated environment for it. Here we
show the conda command to create an environment called
proj1
with Python version 3.8, then including packages likepandas
,numpy
etc.conda create -n proj1 python=3.8 pandas ipykernel numpy
This would handle all the dependencies between the packages.
- Before working on a project, activate the environment then launch Python
from the environment to use the proper functions.
conda activate proj1 ipython
Dominant packages in data science
Here’s a list of the popular packages used in data science that will be covered in later chapters.
numpy
for matrix computationpandas
for basic data wrangling and jointly manipulating multiple data typesmatplotlib
the foundational visualization packageseaborn
is prettier :)
requests
is used to interact with different APIs (e.g. websites)beautifulsoup4
is used to parse websites’ HTML or XML codescikit-learn
is where most machine learning and statistical algorithms are stored.
When sourcing these packages, it’s common to shorten their name. Below
we shorten numpy
to np
so the code is shorter.
import numpy as np
demo_array = np.array([1, 2, 3])