Learning Python from the Data Perspective
In contrast to most programming tutorials, these notes will teach you Python from a data perspective rather than a computer science perspective. We will leverage examples in Tech whenever possible.
The table below will have each concept, linked to a detailed explanation for each concept.
Coding concept | programming example | Use case example | statistics example |
1. Interacting with Python and your files | python hello.py |
Your files and programs are located on different "paths/folders" on your computer. Modern computer interfaces abstract this concept away which can confuse beginning programmers. | |
2. Basics syntax - interacting with the REPL | 1 + 1 # works! 1 + 1 # fails |
Expectations in interacting with Python after each command | Makes sense: $$\sum_{i=1}^{10} i$$ Doesn't make sense: $$i \sum_{i=1}^{10}$$ |
3. Assigning variables | n = 10 print(n) sum(range(n)) |
N is used to estimate the cost of sampling, the uncertainty of our hypothetical sample average, and to calculate the sample average. We want the same value used throughout so we give it a name, the sample size. | $$n = 10$$ $$\sum_{i=1}^n i = \sum_{j=1}^n j$$ |
4. Simple and single-valued data types | demo_num = 1 demo_str = "1" demo_bool = True |
Data can be numbers or text, we handle them differently so we have data types for functions to treat them differently too. | $$y \in \mathbb{R}, x \in \{0, 1\}$$ |
5. Simple and multi-valued data types | a_list = [1, 2, '3'] a_dict = {'a': 1, 'b': 2} a_tuple = (1, 2, 3) |
We need to differentiate a single value from a collection of data. | $$y = 1, x = \{1, 2, 3\}$$ |
6. Calling functions | min(3, -1) sum([1, 2, -1]) |
We want to wrap up a collection of commands into a single call with certain inputs and outputs. | $$f:\mathbb{R}^p \to \mathbb{R}$$ |
7. Control flow - for-loops | for i in range(3): print(i) |
How do we ask a computer to repeat its tasks? | $$\forall i, f(x_i)$$ |
8. Importing packages | import math math.log(1) |
To leverage other people's code, we often source in their pacakges. | |
9. if/else and exceptions | if 'a' in 'Broadway': print('Broadway with an "a"') |
When the code changes behavior under different conditions | $$f(x)=\left\{ \begin{array}{ccc} 0 & x < \theta \\ x & x \geq \theta \end{array} \right\} $$ |
10. Reading and writing data to files | x = np.loadtxt('demo.csv', delimiter=',') y = pd.read_csv('demo.csv') |
Getting data loaded and written to and from files | $$$$ |
11. Numpy | import numpy as np demo = np.array([[1, 2, 3], [4, 5, 6]]) demo.reshape(-1, 2) |
The foundational mathematics package that offers many features similar to R's vectors. | $$X(X^TX)^{-1}X$$ |
12. Pandas | import pandas as pd pd.DataFrame([{'sex': 'M', 'score': 2}, {'sex': 'F', 'score': 8}]) |
Python needed something like R's data frames that could handle multiple types of data in a tabular format | |
13. Text manipulation and regular expression | import re re.compile('[A-Z][a-z]+').search('Python') |
How can we express general rules in text like how proper nouns start with a capital letter followed by one or more lower-cased letters. | |
14. Mapping functions instead of looping | from functools import reduce abs_nums = map(abs, [-1, 2, 0]) tot_mag = reduce(lambda x, y: x + y, abs_nums) |
How can we separate the parallizable operations from the ones that cannot? | $$$$ |
15. Date and Time Objects | import datetime datetime.datetime.now() |
How can we handle date/time values since they have different conventions from usual numbers. E.g. 60 secs is in 1 minute. | $$$$ |
16. Data visualization with seaborn | import matplotlib.pyplot as plt import numpy as np import seaborn as sns x = np.linspace(-1, 1, 100) y = (0.1 * x - 0.5 * np.power(x, 2) + np.random.normal(size=len(x), scale=0.03)) sns.relplot(x, y) plt.show() |
A picture is worth a thousand words | $$$$ |
17. Interacting with APIs | To interact with machines, especially online, there is a standard protocol to give and get data. | $$$$ | |
18. sklearn and fitting models | from sklearn.linear_model import LinearRegression ols = LinearRegression().fit(X, Y) pred = ols.predict(X2) |
Models are mathematical instruments that can help scientist understand patterns in the data, understand how the world works, or to simply predict future outcomes. These models are tuned and validated using data. This module talks about fitting different models with data using `sklearn`. | $$Y = f(X, \beta) + \epsilon$$ $$\hat{\beta} = \arg\min_\beta Loss(Y, \hat{Y}(\beta))$$ |
19. Basic Optimization | from scipy.optimize import minimize min_out = minimize(obj) |
To fit a model to data, we need to define what a good model (or bad model) looks like. Finding the least bad or best model is an optimization exercise. | $$$$ |
20. SQL | SELECT COUNT(DISTINCT(customer_id)) AS uniq_users FROM orders |
Data often sits in a database and SQL is one of the most popular languages we use to query data from databases. Python3 has a built-in library that allows us to interface with SQLite. | $$$$ |
21. Random functions | import random random.gauss(0, 1) random.choice(['heads', 'tails']) |
To sample from a large list of items, we need something that can generate pseudo-randomness. | $$Y \sim F$$ |
22. Monte Carlo | import numpy as np maxs = [] for _ in range(1000): y = np.random.normal(size=5) maxs.append(np.max(y)) np.mean(maxs) |
Certain probabilistic quantities are well defined but do not have a closed form solution or its exact value is not computationally feasible to perform. This is when we can approximate it using simulations. | $$E(Y) \approx \frac{1}{n}\sum_i Y_i$$ |
23. Pseudo-code | # For every job description # - check whether it sponsors H1B # - scrape its description |
Before you write code, you want to plan out your code, writing pseudo-code is a good way to start this practice | $$$$ |
24. Wrangling | import pandas as pd health = pd.DataFrame( [{'id': 1, 'height': 180, 'weight': 80}, {'id': 2, 'height': 150}]) pd.melt(health, id_vars='id', value_vars=['height', 'weight']) |
Storing data requires flexibility and analyzing data requires structure, e.g. every resume should have their GPA. The translation between these two to facilitate different tasks is called data wrangling. | $$$$ |
25. Debugging and making a minimum reproducible example | # There's an error somewhere below! conditions = [True, False] if conditions: print('We have a True') if np.array(conditions): print('We detected another True') |
Everyone makes mistakes when coding. What are best practices to resolve bugs? | $$$$ |