Learning Python from the Data Perspective

In contrast to most programming tutorials, these notes will teach you Python from a data perspective rather than a computer science perspective. We will leverage examples in Tech whenever possible.

The table below will have each concept, linked to a detailed explanation for each concept.

Coding concept	programming example	Use case example	statistics example
1. Interacting with Python and your files	python hello.py	Your files and programs are located on different "paths/folders" on your computer. Modern computer interfaces abstract this concept away which can confuse beginning programmers.
2. Basics syntax - interacting with the REPL	1 + 1 # works! 1 + 1 # fails	Expectations in interacting with Python after each command	Makes sense: $$\sum_{i=1}^{10} i$$ Doesn't make sense: $$i \sum_{i=1}^{10}$$
3. Assigning variables	n = 10 print(n) sum(range(n))	N is used to estimate the cost of sampling, the uncertainty of our hypothetical sample average, and to calculate the sample average. We want the same value used throughout so we give it a name, the sample size.	$$n = 10$$ $$\sum_{i=1}^n i = \sum_{j=1}^n j$$
4. Simple and single-valued data types	demo_num = 1 demo_str = "1" demo_bool = True	Data can be numbers or text, we handle them differently so we have data types for functions to treat them differently too.	$$y \in \mathbb{R}, x \in \{0, 1\}$$
5. Simple and multi-valued data types	a_list = [1, 2, '3'] a_dict = {'a': 1, 'b': 2} a_tuple = (1, 2, 3)	We need to differentiate a single value from a collection of data.	$$y = 1, x = \{1, 2, 3\}$$
6. Calling functions	min(3, -1) sum([1, 2, -1])	We want to wrap up a collection of commands into a single call with certain inputs and outputs.	$$f:\mathbb{R}^p \to \mathbb{R}$$
7. Control flow - for-loops	for i in range(3): print(i)	How do we ask a computer to repeat its tasks?	$$\forall i, f(x_i)$$
8. Importing packages	import math math.log(1)	To leverage other people's code, we often source in their pacakges.
9. if/else and exceptions	if 'a' in 'Broadway': print('Broadway with an "a"')	When the code changes behavior under different conditions	$$f(x)=\left\{ \begin{array}{ccc} 0 & x < \theta \\ x & x \geq \theta \end{array} \right\} $$
10. Reading and writing data to files	x = np.loadtxt('demo.csv', delimiter=',') y = pd.read_csv('demo.csv')	Getting data loaded and written to and from files	$$$$
11. Numpy	import numpy as np demo = np.array([[1, 2, 3], [4, 5, 6]]) demo.reshape(-1, 2)	The foundational mathematics package that offers many features similar to R's vectors.	$$X(X^TX)^{-1}X$$
12. Pandas	import pandas as pd pd.DataFrame([{'sex': 'M', 'score': 2}, {'sex': 'F', 'score': 8}])	Python needed something like R's data frames that could handle multiple types of data in a tabular format
13. Text manipulation and regular expression	import re re.compile('[A-Z][a-z]+').search('Python')	How can we express general rules in text like how proper nouns start with a capital letter followed by one or more lower-cased letters.
14. Mapping functions instead of looping	from functools import reduce abs_nums = map(abs, [-1, 2, 0]) tot_mag = reduce(lambda x, y: x + y, abs_nums)	How can we separate the parallizable operations from the ones that cannot?	$$$$
15. Date and Time Objects	import datetime datetime.datetime.now()	How can we handle date/time values since they have different conventions from usual numbers. E.g. 60 secs is in 1 minute.	$$$$
16. Data visualization with seaborn	import matplotlib.pyplot as plt import numpy as np import seaborn as sns x = np.linspace(-1, 1, 100) y = (0.1 * x - 0.5 * np.power(x, 2) + np.random.normal(size=len(x), scale=0.03)) sns.relplot(x, y) plt.show()	A picture is worth a thousand words	$$$$
17. Interacting with APIs		To interact with machines, especially online, there is a standard protocol to give and get data.	$$$$
18. sklearn and fitting models	from sklearn.linear_model import LinearRegression ols = LinearRegression().fit(X, Y) pred = ols.predict(X2)	Models are mathematical instruments that can help scientist understand patterns in the data, understand how the world works, or to simply predict future outcomes. These models are tuned and validated using data. This module talks about fitting different models with data using `sklearn`.	$$Y = f(X, \beta) + \epsilon$$ $$\hat{\beta} = \arg\min_\beta Loss(Y, \hat{Y}(\beta))$$
19. Basic Optimization	from scipy.optimize import minimize min_out = minimize(obj)	To fit a model to data, we need to define what a good model (or bad model) looks like. Finding the least bad or best model is an optimization exercise.	$$$$
20. SQL	SELECT COUNT(DISTINCT(customer_id)) AS uniq_users FROM orders	Data often sits in a database and SQL is one of the most popular languages we use to query data from databases. Python3 has a built-in library that allows us to interface with SQLite.	$$$$
21. Random functions	import random random.gauss(0, 1) random.choice(['heads', 'tails'])	To sample from a large list of items, we need something that can generate pseudo-randomness.	$$Y \sim F$$
22. Monte Carlo	import numpy as np maxs = [] for _ in range(1000): y = np.random.normal(size=5) maxs.append(np.max(y)) np.mean(maxs)	Certain probabilistic quantities are well defined but do not have a closed form solution or its exact value is not computationally feasible to perform. This is when we can approximate it using simulations.	$$E(Y) \approx \frac{1}{n}\sum_i Y_i$$
23. Pseudo-code	# For every job description # - check whether it sponsors H1B # - scrape its description	Before you write code, you want to plan out your code, writing pseudo-code is a good way to start this practice	$$$$
24. Wrangling	import pandas as pd health = pd.DataFrame( [{'id': 1, 'height': 180, 'weight': 80}, {'id': 2, 'height': 150}]) pd.melt(health, id_vars='id', value_vars=['height', 'weight'])	Storing data requires flexibility and analyzing data requires structure, e.g. every resume should have their GPA. The translation between these two to facilitate different tasks is called data wrangling.	$$$$
25. Debugging and making a minimum reproducible example	# There's an error somewhere below! conditions = [True, False] if conditions: print('We have a True') if np.array(conditions): print('We detected another True')	Everyone makes mistakes when coding. What are best practices to resolve bugs?	$$$$

Wayne's Github Page

A place to learn about statistics

Learning Python from the Data Perspective