Home   Teaching   Presentations   Publications   Graduate Students   Curriculum Vitae   Problems   Contact

Algorithms for Data Science (Fall 2018)

Course: Algorithms for Data Science

View on GitHub
    Course (Algorithms For Data Science):   Data   Projects

Lecturer: Hossein Hajiabolhassan
The Webpage of the Course: Algorithms For Data Science
Data Science Center, Shahid Beheshti University


Index:

  • Main TextBooks
  • Slides and Papers
    1. Lecture 1: Introduction to Data Science
    2. Lecture 2: Toolkit Lab: Jupyter NoteBook
    3. Lecture 3: Toolkit Lab: Git & GitHub
    4. Lecture 4: Introduction to Data Mining
    5. Lecture 5: MapReduce and the New Software Stack
    6. Lecture 6: Link Analysis
    7. Lecture 7: Toolkit Lab: Orange & Weka
    8. Lecture 8: Representative-Based Clustering
    9. Lecture 9: Hierarchical Clustering
    10. Lecture 10: Density-Based Clustering
    11. Lecture 11: Spectral and Graph Clustering
    12. Lecture 12: Clustering Validation
    13. Lecture 13: Probabilistic Classification
    14. Lecture 14: Decision Tree Classifier
  • Class Time and Location
  • Grading
    • Two Written Exams
  • Prerequisites
    • Linear Algebra
    • Probability and Statistics
  • Account
  • Academic Honor Code
  • Questions
  • Miscellaneous
    • Data
    • Projects

Main TextBooks:

Book 1 Book 2

  • Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, and Jeff Ullman.
    Reading: Chapter 1, Chapter 2 (Sections: 2.1, 2.2, & 2.3), and Chapter 5

  • Data Mining and Analysis: Fundamental Concepts and Algorithms by Mohammed J. Zaki and Wagner Meira Jr.
    Reading: Chapters 13, 14, 15 (Section 15.1), 16, 17, 18, and 19

Slides and Papers

Recommended Slides & Papers:

  1. Introduction to Data Science

     Required Reading:
    
    • Slide: Introduction to Data Science by Zico Kolter
    • Slide: Introduction to Data Science by Kevin Markham
    • Paper: Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work
  2. Toolkit Lab: Jupyter NoteBook

     Required Reading:
    
    • Slide: Practical Data Science: Jupyter NoteBook Lab by Zico Kolter
  3. Toolkit Lab: Git & GitHub

     Required Reading:
    
    • Slide: An Introduction to Git by Politecnico di Torino
    • Slide: GIT for Beginners by Anthony Baire
  4. Introduction to Data Mining

     Required Reading:
    
    • Chapter 1 of Mining of Massive Datasets
    • Slide: Introduction to Data Mining by U Kang
    • Slide: Bonferroni’s Principle by Irene Finocchi
  5. MapReduce and the New Software Stack

     Required Reading:
    
    • Chapter 2 of Mining of Massive Datasets
    • Slide of Sections 2.1 & 2.2 (Distributed File Systems & MapReduce): Introduction & Mapreduce by Jure Leskovec
    • Slide of Section 2.3 (Algorithms Using MapReduc): Relational Algebra with MapReduce by Damiano Carra
    • Slide: MapReduce by Paul Krzyzanowski
    • Slide: Introduction to Database Systems (Relational Algebra) by Werner Nutt
  6. Link Analysis

     Required Reading:
    
    • Chapter 5 of Mining of Massive Datasets
    • Slide of Sections 5.1, 5.2 (PageRank, Efficient Computation of PageRank): Analysis of Large Graphs 1
    • Slide of Sections 5.3-5.5 (Topic-Sensitive PageRank, Link Spam, Hubs and Authorities): Analysis of Large Graphs 2
    • Slide: The Linear Algebra Aspects of PageRank by Ilse Ipsen
       Additional Reading:
      
    • Paper: A Survey on Proximity Measures for Social Networks by Sara Cohen, Benny Kimelfeld, Georgia Koutrika
  7. Toolkit Lab: Orange & Weka

     Required Reading:
    
    • Orange: Youtube Tutorial of Orange & Widget Catalog of Orange
       Additional Reading:
      
    • Weka: Data Mining with Weka
    • Free online courses on data mining with machine learning techniques in Weka. Also, you can register the course via FutureLearn Education Platform.
  8. Representative-Based Clustering

     Required Reading:
    
    • Chapter 13 of Data Mining & Analysis
      Exercises 13.5: Q2, Q4, Q6, Q7
    • Slides (Representative-based Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
    • Slide: Clustering by Matt Dickenson
    • Slide: Introduction to Machine Learning (Clustering and EM) by Barnabás Póczos & Aarti Singh
    • Tutorial: The Expectation Maximization Algorithm by Sean Borman
    • Tutorial: What is Bayesian Statistics? by John W Stevens
       Additional Reading:
      
    • Slide: Tutorial on Estimation and Multivariate Gaussians by Shubhendu Trivedi
    • Slide: Mixture Model by Jing Gao
    • Paper: Fast Exact k-Means, k-Medians and Bregman Divergence Clustering in 1D
    • Paper: k-Means Requires Exponentially Many Iterations Even in the Plane by Andrea Vattani
    • Book: Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David
  9. Hierarchical Clustering

     Required Reading:
    
    • Chapter 14 of Data Mining & Analysis
      Exercises 14.4: Q4
    • Slides (Hierarchical Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
    • Slide: Hierarchical Clustering by Jonathan Taylor
    • Slide: Data Structures (Heap) by Wing-Kai Hon
       Additional Reading:
      
    • Slide: Hierarchical Clustering for Gene Expression Data Analysis by Giorgio Valentini
    • Slide: Hierarchical Clustering by Jing Gao
    • Slide: Binary Heaps
    • A Short Note: Proof for the Complexity of Building a Heap by Hu Ding
    • Lecture: Finding Meaningful Clusters in Data by Sanjoy Dasgupta
    • Paper: An Impossibility Theorem for Clustering by Jon Kleinberg
  10. Density-Based Clustering

    Required Reading:
    
    • Chapter 15 of Data Mining & Analysis
    • Slides of Section 15.1 (Density-based Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
    • Slide: Spatial Database Systems by Ralf Hartmut Güting
  11. Spectral and Graph Clustering

    Required Reading:
    
    • Chapter 16 of Data Mining & Analysis
      Exercises 16.5: Q2, Q3, Q6
    • Slides (Spectral and Graph Clustering): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
    • Slide: Spectral Clustering by Andrew Rosenberg
    • Slide: Introduction to Spectral Clustering by Vasileios Zografos and Klas Nordberg
      Additional Reading:
      
    • Slide: Spectral Methods by Jing Gao
    • Tutorial: A Tutorial on Spectral Clustering by Ulrike von Luxburg
    • Tutorial: Matrix Differentiation by Randal J. Barnes
    • Lecture: Spectral Methods by Sanjoy Dasgupta
    • Paper: Positive Semidefinite Matrices and Variational Characterizations of Eigenvalues by Wing-Kin Ma
  12. Clustering Validation

    Required Reading:
    
    • Chapter 17 of Data Mining & Analysis
    • Slides of Section 17.1 (Clustering Validation): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
    • Slide: Clustering Analysis by Enza Messina
    • Slide: Information Theory by Jossy Sayir
    • Slide: Normalized Mutual Information: Estimating Clustering Quality by Bilal Ahmed
      Additional Reading:
      
    • Slide: Clustering Evaluation (II) by Andrew Rosenberg
    • Slide: Evaluation (I) by Andrew Rosenberg
  13. Probabilistic Classification

    Required Reading:
    
    • Chapter 18 of Data Mining & Analysis
    • Slides (Probabilistic Classification): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
    • Slide: Naïve Bayes Classifier by Eamonn Keogh
      Additional Reading:
      
    • Slide: Bayes Nets for Representing and Reasoning About Uncertainty by Andrew W. Moore
    • Slide: A Tutorial on Bayesian Networks by Weng-Keen Wong
  14. Decision Tree Classifier

    Required Reading:
    
    • Chapter 19 of Data Mining & Analysis
    • Slides (Decision Tree Classifier): PDF, PPT by Mohammed J. Zaki and Wagner Meira Jr.
    • Slide: Information Gain by Linda Shapiro

Additional Slides:

  • Practical Data Science by Zico Kolter
  • Course: Data Mining by U Kang
  • Crash Course in Spark by Daniel Templeton
  • Statistical Data Mining Tutorials by Andrew W. Moore

Class time and Location

Saturday and Monday 08:00-09:30 AM (Fall 2018), Room 208.

Grading:

  • Homework – 15%
    — Will consist of mathematical problems and/or programming assignments.
  • Midterm – 35%
  • Endterm – 50%

Two Written Exams:

Midterm Examination: Monday 1397/09/12, 08:00-10:00
Final Examination: Sunday 1397/10/16, 08:30-10:30

Prerequisites:

General mathematical sophistication; and a solid understanding of Algorithms, Linear Algebra, and Probability Theory, at the advanced undergraduate or beginning graduate level, or equivalent.

Linear Algebra:

  • Video: Professor Gilbert Strang’s Video Lectures on linear algebra.

Probability and Statistics:

  • Learn Probability and Statistics Through Interactive Visualizations: Seeing Theory was created by Daniel Kunin while an undergraduate at Brown University. The goal of this website is to make statistics more accessible through interactive visualizations (designed using Mike Bostock’s JavaScript library D3.js).
  • Statistics and Probability: This website provides training and tools to help you solve statistics problems quickly, easily, and accurately - without having to ask anyone for help.
  • Jupyter NoteBooks: Introduction to Statistics by Bargava
  • Video: Professor John Tsitsiklis’s Video Lectures on Applied Probability.
  • Video: Professor Krishna Jagannathan’s Video Lectures on Probability Theory.

Topics:

Have a look at some reports of Kaggle or Stanford students (CS224N, CS224D) to get some general inspiration.

Account:

It is necessary to have a GitHub account to share your projects. It offers plans for both private repositories and free accounts. Github is like the hammer in your toolbox, therefore, you need to have it!

Academic Honor Code:

Honesty and integrity are vital elements of the academic works. All your submitted assignments must be entirely your own (or your own group’s).

We will follow the standard of Department of Mathematical Sciences approach:

  • You can get help, but you MUST acknowledge the help on the work you hand in
  • Failure to acknowledge your sources is a violation of the Honor Code
  • You can talk to others about the algorithm(s) to be used to solve a homework problem; as long as you then mention their name(s) on the work you submit
  • You should not use code of others or be looking at code of others when you write your own: You can talk to people but have to write your own solution/code

Questions?

I will be having office hours for this course on Monday (09:30 AM–12:00 AM). If this is not convenient, email me at hhaji@sbu.ac.ir or talk to me after class.

Algorithms-For-Data-Science is maintained by hhaji. This page was generated by GitHub Pages.