Wayne's Github Page

A place to learn about statistics

Learning Python from the Data Perspective

In contrast to most programming tutorials, these notes will teach you Python from a data perspective rather than a computer science perspective. We will leverage examples in Tech whenever possible.

The table below will have each concept, linked to a detailed explanation for each concept.

Coding concept programming example Use case example statistics example
1. Interacting with Python and your files
  python hello.py
  
Your files and programs are located on different "paths/folders" on your computer. Modern computer interfaces abstract this concept away which can confuse beginning programmers.
2. Basics syntax - interacting with the REPL
  1 + 1  # works!
   1 + 1 # fails
  
Expectations in interacting with Python after each command Makes sense: $$\sum_{i=1}^{10} i$$ Doesn't make sense: $$i \sum_{i=1}^{10}$$
3. Assigning variables
  n = 10
  print(n)
  sum(range(n))
  
N is used to estimate the cost of sampling, the uncertainty of our hypothetical sample average, and to calculate the sample average. We want the same value used throughout so we give it a name, the sample size. $$n = 10$$ $$\sum_{i=1}^n i = \sum_{j=1}^n j$$
4. Simple and single-valued data types
  demo_num = 1
  demo_str = "1"
  demo_bool = True
  
Data can be numbers or text, we handle them differently so we have data types for functions to treat them differently too. $$y \in \mathbb{R}, x \in \{0, 1\}$$
5. Simple and multi-valued data types
  a_list = [1, 2, '3']
  a_dict = {'a': 1,
            'b': 2}
  a_tuple = (1, 2, 3)
  
We need to differentiate a single value from a collection of data. $$y = 1, x = \{1, 2, 3\}$$
6. Calling functions
  min(3, -1)
  sum([1, 2, -1])
  
We want to wrap up a collection of commands into a single call with certain inputs and outputs. $$f:\mathbb{R}^p \to \mathbb{R}$$
7. Control flow - for-loops
  for i in range(3):
    print(i)
  
How do we ask a computer to repeat its tasks? $$\forall i, f(x_i)$$
8. Importing packages
  import math
  math.log(1)
  
To leverage other people's code, we often source in their pacakges.
9. if/else and exceptions
  if 'a' in 'Broadway':
      print('Broadway with an "a"')
  
When the code changes behavior under different conditions $$f(x)=\left\{ \begin{array}{ccc} 0 & x < \theta \\ x & x \geq \theta \end{array} \right\} $$
10. Reading and writing data to files
  x = np.loadtxt('demo.csv', delimiter=',')
  y = pd.read_csv('demo.csv')
  
Getting data loaded and written to and from files $$$$
11. Numpy
  import numpy as np
  demo = np.array([[1, 2, 3],
                   [4, 5, 6]])
  demo.reshape(-1, 2)
  
The foundational mathematics package that offers many features similar to R's vectors. $$X(X^TX)^{-1}X$$
12. Pandas
  import pandas as pd
  pd.DataFrame([{'sex': 'M', 'score': 2},
                {'sex': 'F', 'score': 8}])
  
Python needed something like R's data frames that could handle multiple types of data in a tabular format
13. Text manipulation and regular expression
  import re
  re.compile('[A-Z][a-z]+').search('Python')
  
How can we express general rules in text like how proper nouns start with a capital letter followed by one or more lower-cased letters.
14. Mapping functions instead of looping
  from functools import reduce
  abs_nums = map(abs, [-1, 2, 0])
  tot_mag = reduce(lambda x, y: x + y, abs_nums)
  
How can we separate the parallizable operations from the ones that cannot? $$$$
15. Date and Time Objects
  import datetime
  datetime.datetime.now()
  
How can we handle date/time values since they have different conventions from usual numbers. E.g. 60 secs is in 1 minute. $$$$
16. Data visualization with seaborn
  import matplotlib.pyplot as plt
  import numpy as np
  import seaborn as sns
  x = np.linspace(-1, 1, 100)
  y = (0.1 * x - 0.5 * np.power(x, 2)
       + np.random.normal(size=len(x),
                          scale=0.03))
  sns.relplot(x, y)
  plt.show()
  
A picture is worth a thousand words $$$$
17. Interacting with APIs
  
To interact with machines, especially online, there is a standard protocol to give and get data. $$$$
18. sklearn and fitting models
  from sklearn.linear_model import LinearRegression
  ols = LinearRegression().fit(X, Y)
  pred = ols.predict(X2)
  
Models are mathematical instruments that can help scientist understand patterns in the data, understand how the world works, or to simply predict future outcomes. These models are tuned and validated using data. This module talks about fitting different models with data using `sklearn`. $$Y = f(X, \beta) + \epsilon$$ $$\hat{\beta} = \arg\min_\beta Loss(Y, \hat{Y}(\beta))$$
19. Basic Optimization
  from scipy.optimize import minimize

  min_out = minimize(obj)
  
To fit a model to data, we need to define what a good model (or bad model) looks like. Finding the least bad or best model is an optimization exercise. $$$$
20. SQL
  SELECT
    COUNT(DISTINCT(customer_id)) AS uniq_users
  FROM
    orders
  
Data often sits in a database and SQL is one of the most popular languages we use to query data from databases. Python3 has a built-in library that allows us to interface with SQLite. $$$$
21. Random functions
  import random
  random.gauss(0, 1)
  random.choice(['heads', 'tails'])
  
To sample from a large list of items, we need something that can generate pseudo-randomness. $$Y \sim F$$
22. Monte Carlo
  import numpy as np
  maxs = []
  for _ in range(1000):
      y = np.random.normal(size=5)
      maxs.append(np.max(y))
  np.mean(maxs)
  
Certain probabilistic quantities are well defined but do not have a closed form solution or its exact value is not computationally feasible to perform. This is when we can approximate it using simulations. $$E(Y) \approx \frac{1}{n}\sum_i Y_i$$
23. Pseudo-code
  # For every job description
  #   - check whether it sponsors H1B 
  #   - scrape its description
  
Before you write code, you want to plan out your code, writing pseudo-code is a good way to start this practice $$$$
24. Wrangling
  import pandas as pd
  health = pd.DataFrame(
      [{'id': 1, 'height': 180, 'weight': 80},
       {'id': 2, 'height': 150}])
  pd.melt(health, id_vars='id', value_vars=['height', 'weight'])
  
Storing data requires flexibility and analyzing data requires structure, e.g. every resume should have their GPA. The translation between these two to facilitate different tasks is called data wrangling. $$$$
25. Debugging and making a minimum reproducible example
  # There's an error somewhere below!
  conditions = [True, False]
  if conditions:
     print('We have a True')
  if np.array(conditions):
     print('We detected another True')
  
Everyone makes mistakes when coding. What are best practices to resolve bugs? $$$$