Applied Data Mining
This course will expose students to a variety of data mining applications using machine learning methods.
Students who finish this class should:
- Gain an intuitive understanding of basic machine learning methods
- Understand how fitting models can help explore patterns in data
- Understand how to assess models and clustering in different usecases
Prerequisites
- An introductory statistics class
- Basic probability distributions (e.g. Gaussian, binomial distributions and their likelihoods)
- Basic hypothesis testing (e.g. t-test)
- Summary statistics
- Histograms, boxplots, etc
- A computing course involving data wrangling and visualization
- A modeling course that estimated parameters from data
Textbooks / References
- An Introduction to Statistical Learning with Applications in R by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani
- To collaborate on coding projects, here’s a bare minimum GitHub tutorial. If you ever work officially with code, you should also look into the concept of branches and reviews which are not covered in the tutorial.
Timeline
I reserve the right to change the ordering and the content for the course throughout the semester.
Date | Topic | Reference | Due |
---|---|---|---|
2022-01-18 | Intro to data mining | - Brazilian e-commerce on Kaggle - ISL Chapter 2.2 |
|
2022-01-20 | Data mining with basic statistics and regression review | ISL Chapter 3 | - Have R studio installed - Informal exploration with the Brazilian e-commerce dataset |
2022-01-25 | Regression review continued | ISL Chapter 3 | |
2022-01-27 | Logistic regression + Naive Bayes | Homework 1 Due | |
2022-02-01 | Principal Component Analysis | ISL 6.3.1 + ISL 10.2 | |
2022-02-03 | Principal Component Analysis Applications | ISL 6.3.1 + ISL 10.2 | |
2022-02-08 | Rise of machine learning and “wrong” models - some history | Paper on Why Biased Estimators given Stein Estimator + Gauss Markov Theorem + ISL Chapter 2.2 continued | |
2022-02-10 | Ridge + Lasso Regression | ISL 6.2 | Homework 2 Due |
2022-02-15 | Ridge + Lasso Simulations | ISL 6.2 | |
2022-02-17 | Tree Methods | ISL 8.1 | |
2022-02-22 | Tree Methods continued | ISL 8.1 | |
2022-02-24 | Trees + forests with real data | ISL 8.2 bias in random forest variable importance |
Homework 3 |
2022-03-01 | Optimization and objective functions | Slide 5 + ISL Chapter 3.1.1 + 3.3.3 | |
2022-03-03 | Beyond classification accuracy | Slides 6 | |
2022-03-08 | Resampling techniques - accuracy vs robustness | Slides 7 + Resampling from ISL | - Read paper on Stability |
2022-03-10 | Automated Model Selection | Slides 7 + caret library + ISL on resampling | Project 1 Due |
2022-03-15 | Spring Break | ||
2022-03-17 | Spring Break | ||
2022-03-22 | Clustering - Kmeans | ISL 10.2 | |
2022-03-24 | Clustering - Kmeans continued | ISL 10.2 | |
2022-03-29 | K-means with real data | ISL 10.2 | |
2022-03-31 | Hierarchical clustering | ISL 10.2 | Homework 4 |
2022-04-05 | Hierarchical clustering with real data | ISL 10.2 | |
2022-04-07 | DBSCAN | DBSCAN from KDNuggets | |
2022-04-12 | feature engineering - with text | Pre-processing Text + Speech and Language Chapter 6.5 | |
2022-04-14 | Working with text data continued | ||
2022-04-19 | Independent Component Analysis | Stanford ICA Slides | |
2022-04-21 | Models on text including Wordfish | Homework 5 due 4/22/2022 | |
2022-04-26 | Going over final projects in class | ||
2022-04-28 | Going over final projects in class + what we didn’t teach | Project 2 Due on ~5/2/2022~ 5/7/2022 |
Logistics
Lectures: TuTh 2:40pm - 3:55pm, 503 Hamilton Hall
Teaching Team
Wayne Tai Lee (wtl2109)
- OH: TuTh 1-2pm
- Zoom
- Room: Watson Hall (W 115th St) room 714 Andrew Davison (ad3395)
- OH: MW 1-2pm
- Zoom
- Room TBD
Online Discussion
The TA and grader will check the online discussion for 30 minutes each weekday. Do not expect an immediate response so please start your work early and understand that you should post your questions more clearly.
Grading
If your final grade is in [93-100], you will earn at least an A, [90-93) will earn at least an A-, [87-90) will earn at least a B+, etc. A grading curves may occur depending on the class performance but I will not curve downwards. I may not give out A+’s in this class.
- Homeworks (30%)
- Late homeworks will receive 0 credit
- Homeworks will receive 0 credit if the code + write up is not submitted in both the .ipynb/.Rmd AND the knitted PDF or HTML form.
- Projects (60%)
- Late projects will be penalized by 50% for each day it’s late.
- Projects should be submitted on Canvas
- Participation (10%)
- Instead of attendance, in class activities, recorded through Canvas, is how we’ll grade this.
- If you surpass 50% here, you’ll receive the full credit for participation, otherwise you’ll receive 0 credit.
Acknowledgement
A lot of these materials are based off the materials from Prof Vincent Dorie.