Algorithms for Data Science (Fall 2018)

Lecturer: Hossein Hajiabolhassan
The Webpage of the Course: Algorithms For Data Science
Data Science Center, Shahid Beheshti University

Index:

Main TextBooks
Slides and Papers
1. Lecture 1: Introduction to Data Science
2. Lecture 2: Toolkit Lab: Jupyter NoteBook
3. Lecture 3: Toolkit Lab: Git & GitHub
4. Lecture 4: Introduction to Data Mining
5. Lecture 5: MapReduce and the New Software Stack
6. Lecture 6: Link Analysis
7. Lecture 7: Toolkit Lab: Orange & Weka
8. Lecture 8: Representative-Based Clustering
9. Lecture 9: Hierarchical Clustering
10. Lecture 10: Density-Based Clustering
11. Lecture 11: Spectral and Graph Clustering
12. Lecture 12: Clustering Validation
13. Lecture 13: Probabilistic Classification
14. Lecture 14: Decision Tree Classifier
Class Time and Location
Grading
- Two Written Exams
Prerequisites
- Linear Algebra
- Probability and Statistics
Account
Academic Honor Code
Questions
Miscellaneous
- Data
- Projects

Main TextBooks:

Book 1 Book 2

Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
Reading: Chapter 1, Chapter 2 (Sections: 2.1, 2.2, & 2.3), and Chapter 5
Data Mining and Analysis: Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira Jr.
Reading: Chapters 13, 14, 15 (Section 15.1), 16, 17, 18, and 19

Slides and Papers

Recommended Slides & Papers:

Introduction to Data Science
```
 Required Reading:
```
- Slide: Introduction to Data Science by Zico Kolter
- Slide: Introduction to Data Science by Kevin Markham
- Paper: Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work
Toolkit Lab: Jupyter NoteBook
```
 Required Reading:
```
- Slide: Practical Data Science: Jupyter NoteBook Lab by Zico Kolter
Toolkit Lab: Git & GitHub
```
 Required Reading:
```
- Slide: An Introduction to Git by Politecnico di Torino
- Slide: GIT for Beginners by Anthony Baire
Introduction to Data Mining
```
 Required Reading:
```
- Chapter 1 of Mining of Massive Datasets
- Slide: Introduction to Data Mining by U Kang
- Slide: Bonferroni’s Principle by Irene Finocchi
MapReduce and the New Software Stack
```
 Required Reading:
```
- Chapter 2 of Mining of Massive Datasets
- Slide of Sections 2.1 & 2.2 (Distributed File Systems & MapReduce): Introduction & Mapreduce by Jure Leskovec
- Slide of Section 2.3 (Algorithms Using MapReduc): Relational Algebra with MapReduce by Damiano Carra
- Slide: MapReduce by Paul Krzyzanowski
- Slide: Introduction to Database Systems (Relational Algebra) by Werner Nutt
Link Analysis
```
 Required Reading:
```
- Chapter 5 of Mining of Massive Datasets
- Slide of Sections 5.1, 5.2 (PageRank, Efficient Computation of PageRank): Analysis of Large Graphs 1
- Slide of Sections 5.3-5.5 (Topic-Sensitive PageRank, Link Spam, Hubs and Authorities): Analysis of Large Graphs 2
- Slide: The Linear Algebra Aspects of PageRank by Ilse Ipsen
```
 Additional Reading:
```
- Paper: A Survey on Proximity Measures for Social Networks by Sara Cohen, Benny Kimelfeld, Georgia Koutrika
Toolkit Lab: Orange & Weka
```
 Required Reading:
```
- Orange: Youtube Tutorial of Orange & Widget Catalog of Orange
```
 Additional Reading:
```
- Weka: Data Mining with Weka
- Free online courses on data mining with machine learning techniques in Weka. Also, you can register the course via FutureLearn Education Platform.
Representative-Based Clustering
```
 Required Reading:
```
- Chapter 13 of Data Mining & Analysis
  Exercises 13.5: Q2, Q4, Q6, Q7
- Slides (Representative-based Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Clustering by Matt Dickenson
- Slide: Introduction to Machine Learning (Clustering and EM) by Barnabás Póczos & Aarti Singh
- Tutorial: The Expectation Maximization Algorithm by Sean Borman
- Tutorial: What is Bayesian Statistics? by John W Stevens
```
 Additional Reading:
```
- Slide: Tutorial on Estimation and Multivariate Gaussians by Shubhendu Trivedi
- Slide: Mixture Model by Jing Gao
- Paper: Fast Exact k-Means, k-Medians and Bregman Divergence Clustering in 1D
- Paper: k-Means Requires Exponentially Many Iterations Even in the Plane by Andrea Vattani
- Book: Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David
Hierarchical Clustering
```
 Required Reading:
```
- Chapter 14 of Data Mining & Analysis
  Exercises 14.4: Q4
- Slides (Hierarchical Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Hierarchical Clustering by Jonathan Taylor
- Slide: Data Structures (Heap) by Wing-Kai Hon
```
 Additional Reading:
```
- Slide: Hierarchical Clustering for Gene Expression Data Analysis by Giorgio Valentini
- Slide: Hierarchical Clustering by Jing Gao
- Slide: Binary Heaps
- A Short Note: Proof for the Complexity of Building a Heap by Hu Ding
- Lecture: Finding Meaningful Clusters in Data by Sanjoy Dasgupta
- Paper: An Impossibility Theorem for Clustering by Jon Kleinberg
Density-Based Clustering
```
Required Reading:
```
- Chapter 15 of Data Mining & Analysis
- Slides of Section 15.1 (Density-based Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Spatial Database Systems by Ralf Hartmut Güting
Spectral and Graph Clustering
```
Required Reading:
```
- Chapter 16 of Data Mining & Analysis
  Exercises 16.5: Q2, Q3, Q6
- Slides (Spectral and Graph Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Spectral Clustering by Andrew Rosenberg
- Slide: Introduction to Spectral Clustering by Vasileios Zografos and Klas Nordberg
```
Additional Reading:
```
- Slide: Spectral Methods by Jing Gao
- Tutorial: A Tutorial on Spectral Clustering by Ulrike von Luxburg
- Tutorial: Matrix Differentiation by Randal J. Barnes
- Lecture: Spectral Methods by Sanjoy Dasgupta
- Paper: Positive Semidefinite Matrices and Variational Characterizations of Eigenvalues by Wing-Kin Ma
Clustering Validation
```
Required Reading:
```
- Chapter 17 of Data Mining & Analysis
- Slides of Section 17.1 (Clustering Validation): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Clustering Analysis by Enza Messina
- Slide: Information Theory by Jossy Sayir
- Slide: Normalized Mutual Information: Estimating Clustering Quality by Bilal Ahmed
```
Additional Reading:
```
- Slide: Clustering Evaluation (II) by Andrew Rosenberg
- Slide: Evaluation (I) by Andrew Rosenberg
Probabilistic Classification
```
Required Reading:
```
- Chapter 18 of Data Mining & Analysis
- Slides (Probabilistic Classification): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Naïve Bayes Classifier by Eamonn Keogh
```
Additional Reading:
```
- Slide: Bayes Nets for Representing and Reasoning About Uncertainty by Andrew W. Moore
- Slide: A Tutorial on Bayesian Networks by Weng-Keen Wong
Decision Tree Classifier
```
Required Reading:
```
- Chapter 19 of Data Mining & Analysis
- Slides (Decision Tree Classifier): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
- Slide: Information Gain by Linda Shapiro

Additional Slides:

Practical Data Science by Zico Kolter
Course: Data Mining by U Kang
Crash Course in Spark by Daniel Templeton
Statistical Data Mining Tutorials by Andrew W. Moore

Class time and Location

Saturday and Monday 08:00-09:30 AM (Fall 2018), Room 208.

Grading:

Homework – 15%
— Will consist of mathematical problems and/or programming assignments.
Midterm – 35%
Endterm – 50%

Two Written Exams:

Midterm Examination: Monday 1397/09/12, 08:00-10:00
Final Examination: Sunday 1397/10/16, 08:30-10:30

Prerequisites:

General mathematical sophistication; and a solid understanding of Algorithms, Linear Algebra, and Probability Theory, at the advanced undergraduate or beginning graduate level, or equivalent.

Linear Algebra:

Video: Professor Gilbert Strang’s Video Lectures on linear algebra.

Probability and Statistics:

Learn Probability and Statistics Through Interactive Visualizations: Seeing Theory was created by Daniel Kunin while an undergraduate at Brown University. The goal of this website is to make statistics more accessible through interactive visualizations (designed using Mike Bostock’s JavaScript library D3.js).
Statistics and Probability: This website provides training and tools to help you solve statistics problems quickly, easily, and accurately - without having to ask anyone for help.
Jupyter NoteBooks: Introduction to Statistics by Bargava
Video: Professor John Tsitsiklis’s Video Lectures on Applied Probability.
Video: Professor Krishna Jagannathan’s Video Lectures on Probability Theory.

Topics:

Have a look at some reports of Kaggle or Stanford students (CS224N, CS224D) to get some general inspiration.

Account:

It is necessary to have a GitHub account to share your projects. It offers plans for both private repositories and free accounts. Github is like the hammer in your toolbox, therefore, you need to have it!

Academic Honor Code:

Honesty and integrity are vital elements of the academic works. All your submitted assignments must be entirely your own (or your own group’s).

We will follow the standard of Department of Mathematical Sciences approach:

You can get help, but you MUST acknowledge the help on the work you hand in
Failure to acknowledge your sources is a violation of the Honor Code
You can talk to others about the algorithm(s) to be used to solve a homework problem; as long as you then mention their name(s) on the work you submit
You should not use code of others or be looking at code of others when you write your own: You can talk to people but have to write your own solution/code

Questions?

I will be having office hours for this course on Monday (09:30 AM–12:00 AM). If this is not convenient, email me at hhaji@sbu.ac.ir or talk to me after class.