Home   Teaching   Presentations   Publications   Graduate Students   Curriculum Vitae   Problems   Contact

Deep Learning

Course: Deep Learning

View on GitHub
    Course (Deep Learning):   Deep Learning Tutorials   TensorFlow Tutorials   Graph Neural Networks   Projects   Data Handling

Data Handling

Here we introduce several blogs related to data and data handling and also some resources of datasets.

  • Improve Your Data Literacy Skills and Make the Most of Data by Geckoboard Company
    • Tips for Effective Data Visualization
    • Common Data Mistakes to Avoid

    Poster

    Common Data Mistakes to Avoid
    Cherry Picking Data Dredging Survivorship Bias Cobra Effect False Causality
    Gerrymandering Sampling Bias Gambler’s Fallacy Hawthorne Effect Regression Toward the Mean
    Simpson’s Paradox McNamara Fallacy Overfitting Publication Bias Danger of Summary Metrics

Dealing with Data

  • Slide: Data Preparation by João Mendes Moreira and José Luís Borges
  • Slide: Data Preprocessing by Taehyung Wang
  • Slide: Learning with Missing Labels by Vivek Srikumar
  • Slide: Data Cleaning and Data Preprocessing by Nguyen Hung Son
  • Blog: Applying Wrapper Methods in Python for Feature Selection by Usman Malik
  • Blog: Basics of Feature Selection with Python by Andika Rachman
  • Blog: Exhaustive Feature Selector by Sebastian Raschka
  • Blog: Need for Feature Engineering in Machine Learning by Ashish Bansal
  • Blog: Data Preprocessing by Zdravko Markov
  • Blog: How to Handle Correlated Features? by Reinhard Sellmair
  • Blog: 5 Ways To Handle Missing Values In Machine Learning Datasets
  • Blog: Handling Missing Data
  • Blog: How to Handle Missing Data
  • Blog: 7 Techniques to Handle Imbalanced Data
  • Blog: Application of Synthetic Minority Over-sampling Technique (SMOTe) for Imbalanced Datasets by Navoneel Chakrabarty
  • Paper: SMOTE: Synthetic Minority Over-sampling Technique by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer
  • Blog: How to Handle Imbalanced Data: An Overview
  • Blog: Visualize Missing Data with VIM Package
  • Ultimate Guide to Handle Big Datasets for Machine Learning Using Dask (in Python)

Datasets

The following resources may be helpful for those still undecided about their course projects.

  • VisualData: Discover computer vision datasets
  • OpenML: An open science platform for machine learning
  • Open Datasets: A list of links to publicly available datasets for a variety of domains.
  • DataHub has a lot of structured data in formats such as RDF and CSV.
  • Datasets for Machine Learning
  • UC Irvine Machine Learning Repository
  • Kaggle Datasets
  • Awesome Public Datasets
  • CrowdFlower Data for Everyone library
  • Stanford Large Network Dataset Collection
  • Data Science Weekly
  • Awesome Data Science
  • Get Financial Data Directly Into R
  • Listen Data from the Green Bank Telescope
  • Cafebazaar
  • 25 Open Datasets for Deep Learning Every Data Scientist Must Work With by Pranav Dar
  • DigiKala (Persian)
  • Statistical Center of Iran
    • Dataset (Persian)

To know more datasets, refer to the following webpage of KDnuggets:

  • Datasets for Data Mining and Data Science

Datasets of Molecules and Their Properties

  • Blog: MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license).
  • Blog: ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
  • Blog: Tox21: The 2014 Tox21 data challenge was designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects. The Tox21 Program (Toxicology in the 21st Century) is an ongoing collaboration among federal agencies to characterize the potential toxicity of chemicals using cells and isolated molecular targets instead of laboratory animals.

Datasets of Graphs

  • Blog: Network Repository. An Interactive Scientific Network Data Repository: The first interactive data and network data repository with real-time visual analytics. Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations in 30+ domains (from biological to social network data). This repository was made by Ryan A. Rossi and Nesreen K. Ahmed.
  • Blog: Graph Classification: The mission of Papers With Code is to create a free and open resource with Machine Learning papers, code and evaluation tables.
  • Blog: Graph Challenge Data Sets: Amazon is making the Graph Challenge data sets available to the community free of charge as part of the AWS Public Data Sets program. The data is being presented in several file formats, and there are a variety of ways to access it.
  • Blog: The House of Graphs: a database of interesting graphs by G. Brinkmann, K. Coolsaet, J. Goedgebeur, and H. Mélot (also see Discrete Applied Mathematics, 161(1-2): 311-314, 2013 (DOI)).
    • Search for Graphs
  • Blog: A Repository of Benchmark Graph Datasets for Graph Classification by Shiruipan
  • Blog: Collection and Streaming of Graph Datasets by Yibo Yao
  • Blog: Big Graph Data Sets by Yongming Luo
  • Blog: MIVIA LDGraphs Dataset: The MIVIA LDGraphs (MIVIA Large Dense Graphs) dataset is a new dataset for benchmarking exact graph matching algorithms. It aims to extend the MIVIA graphs dataset, widely used in the last ten years, with bigger and more dense graphs, so as to face with the problems nowadays encountered in real applications devoted for instance to bioinformatics and social network analysis.
  • Blog: Datasets by Marion Neumann
  • Blog: Graph Dataset by Xiao Meng
  • Blog: Constructors and Databases of Graphs in Sage
  • Datasets in GitHub:
    • Benchmark Dataset for Graph Classification: This repository contains datasets to quickly test graph classification algorithms, such as Graph Kernels and Graph Neural Networks by Filippo Bianchi.
    • GAM: A PyTorch implementation of “Graph Classification Using Structural Attention” (KDD 2018) by Benedek Rozemberczki.
    • CapsGNN: A PyTorch implementation of “Capsule Graph Neural Network” (ICLR 2019) by Benedek Rozemberczki.

### Tools for Creating Graphs

  • Package: Networkx: a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
    • Graph Generators
    • Converting to and from Other Data Formats To NetworkX Graph
    • Reading and Writing Graphs
  • Package: Sage: a viable free open source alternative to Magma, Maple, Mathematica and Matlab.
    • CoCalc: an online service for running SageMath computations online to avoid your own installation of Sage. CoCalc will allow you to work with multiple persistent worksheets in Sage, IPython, LaTeX, and much, much more!
    • Graph Theory in Sage

Data Science Competition Platforms

  • Kaggle
    • Kaggle Competition Past Solutions
    • Kaggle Past Solutions by Eliot Andres
    • The Tips and Tricks I Used to Succeed on Kaggle by Vik Paruchuri
  • DrivenData
  • TunedIT
  • InnoCentive
  • CrowdAnalytix
Deep-Learning is maintained by hhaji. This page was generated by GitHub Pages.