Wayne's Github Page

A place to learn about statistics

Applied Data Mining

This course will expose students to a variety of data mining applications using machine learning methods.

Students who finish this class should:

Prerequisites

Textbooks / References

Timeline

I reserve the right to change the ordering and the content for the course throughout the semester.

Date Topic Reference Due
2022-01-18 Intro to data mining - Brazilian e-commerce on Kaggle
- ISL Chapter 2.2
 
2022-01-20 Data mining with basic statistics and regression review ISL Chapter 3 - Have R studio installed
- Informal exploration with the Brazilian e-commerce dataset
2022-01-25 Regression review continued ISL Chapter 3  
2022-01-27 Logistic regression + Naive Bayes   Homework 1 Due
2022-02-01 Principal Component Analysis ISL 6.3.1 + ISL 10.2  
2022-02-03 Principal Component Analysis Applications ISL 6.3.1 + ISL 10.2  
2022-02-08 Rise of machine learning and “wrong” models - some history Paper on Why Biased Estimators given Stein Estimator + Gauss Markov Theorem + ISL Chapter 2.2 continued  
2022-02-10 Ridge + Lasso Regression ISL 6.2 Homework 2 Due
2022-02-15 Ridge + Lasso Simulations ISL 6.2  
2022-02-17 Tree Methods ISL 8.1  
2022-02-22 Tree Methods continued ISL 8.1  
2022-02-24 Trees + forests with real data ISL 8.2
bias in random forest variable importance
Homework 3
2022-03-01 Optimization and objective functions Slide 5 + ISL Chapter 3.1.1 + 3.3.3  
2022-03-03 Beyond classification accuracy Slides 6  
2022-03-08 Resampling techniques - accuracy vs robustness Slides 7 + Resampling from ISL - Read paper on Stability
2022-03-10 Automated Model Selection Slides 7 + caret library + ISL on resampling Project 1 Due
2022-03-15 Spring Break    
2022-03-17 Spring Break    
2022-03-22 Clustering - Kmeans ISL 10.2  
2022-03-24 Clustering - Kmeans continued ISL 10.2  
2022-03-29 K-means with real data ISL 10.2  
2022-03-31 Hierarchical clustering ISL 10.2 Homework 4
2022-04-05 Hierarchical clustering with real data ISL 10.2  
2022-04-07 DBSCAN DBSCAN from KDNuggets  
2022-04-12 feature engineering - with text Pre-processing Text + Speech and Language Chapter 6.5  
2022-04-14 Working with text data continued    
2022-04-19 Independent Component Analysis Stanford ICA Slides  
2022-04-21 Models on text including Wordfish   Homework 5 due 4/22/2022
2022-04-26 Going over final projects in class    
2022-04-28 Going over final projects in class + what we didn’t teach   Project 2 Due on ~5/2/2022~ 5/7/2022

Logistics

Lectures: TuTh 2:40pm - 3:55pm, 503 Hamilton Hall

Teaching Team

Wayne Tai Lee (wtl2109)

Online Discussion

The TA and grader will check the online discussion for 30 minutes each weekday. Do not expect an immediate response so please start your work early and understand that you should post your questions more clearly.

Grading

If your final grade is in [93-100], you will earn at least an A, [90-93) will earn at least an A-, [87-90) will earn at least a B+, etc. A grading curves may occur depending on the class performance but I will not curve downwards. I may not give out A+’s in this class.

- Homeworks (30%)

Acknowledgement

A lot of these materials are based off the materials from Prof Vincent Dorie.