Data Handling

Here we introduce several blogs related to data and data handling and also some resources of datasets.

Improve Your Data Literacy Skills and Make the Most of Data by Geckoboard Company

Poster

Common	Data	Mistakes	to	Avoid
Cherry Picking	Data Dredging	Survivorship Bias	Cobra Effect	False Causality
Gerrymandering	Sampling Bias	Gambler’s Fallacy	Hawthorne Effect	Regression Toward the Mean
Simpson’s Paradox	McNamara Fallacy	Overfitting	Publication Bias	Danger of Summary Metrics

Dealing with Data

Slide: Data Preparation by João Mendes Moreira and José Luís Borges
Slide: Data Preprocessing by Taehyung Wang
Slide: Learning with Missing Labels by Vivek Srikumar
Slide: Data Cleaning and Data Preprocessing by Nguyen Hung Son
Blog: Applying Wrapper Methods in Python for Feature Selection by Usman Malik
Blog: Basics of Feature Selection with Python by Andika Rachman
Blog: Exhaustive Feature Selector by Sebastian Raschka
Blog: Need for Feature Engineering in Machine Learning by Ashish Bansal
Blog: Data Preprocessing by Zdravko Markov
Blog: How to Handle Correlated Features? by Reinhard Sellmair
Blog: 5 Ways To Handle Missing Values In Machine Learning Datasets
Blog: Handling Missing Data
Blog: How to Handle Missing Data
Blog: 7 Techniques to Handle Imbalanced Data
Blog: Application of Synthetic Minority Over-sampling Technique (SMOTe) for Imbalanced Datasets by Navoneel Chakrabarty
Paper: SMOTE: Synthetic Minority Over-sampling Technique by Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer
Blog: How to Handle Imbalanced Data: An Overview
Blog: Visualize Missing Data with VIM Package
Ultimate Guide to Handle Big Datasets for Machine Learning Using Dask (in Python)

Datasets

The following resources may be helpful for those still undecided about their course projects.

VisualData: Discover computer vision datasets
OpenML: An open science platform for machine learning
Open Datasets: A list of links to publicly available datasets for a variety of domains.
DataHub has a lot of structured data in formats such as RDF and CSV.
Datasets for Machine Learning
UC Irvine Machine Learning Repository
Kaggle Datasets
Awesome Public Datasets
CrowdFlower Data for Everyone library
Stanford Large Network Dataset Collection
Data Science Weekly
Awesome Data Science
Get Financial Data Directly Into R
Listen Data from the Green Bank Telescope
Cafebazaar
25 Open Datasets for Deep Learning Every Data Scientist Must Work With by Pranav Dar
DigiKala (Persian)
Statistical Center of Iran
- Dataset (Persian)

To know more datasets, refer to the following webpage of KDnuggets:

Datasets for Data Mining and Data Science

Datasets of Molecules and Their Properties

Blog: MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license).
Blog: ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
Blog: Tox21: The 2014 Tox21 data challenge was designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects. The Tox21 Program (Toxicology in the 21st Century) is an ongoing collaboration among federal agencies to characterize the potential toxicity of chemicals using cells and isolated molecular targets instead of laboratory animals.

Datasets of Graphs

Blog: Network Repository. An Interactive Scientific Network Data Repository: The first interactive data and network data repository with real-time visual analytics. Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations in 30+ domains (from biological to social network data). This repository was made by Ryan A. Rossi and Nesreen K. Ahmed.
Blog: Graph Classification: The mission of Papers With Code is to create a free and open resource with Machine Learning papers, code and evaluation tables.
Blog: Graph Challenge Data Sets: Amazon is making the Graph Challenge data sets available to the community free of charge as part of the AWS Public Data Sets program. The data is being presented in several file formats, and there are a variety of ways to access it.
Blog: The House of Graphs: a database of interesting graphs by G. Brinkmann, K. Coolsaet, J. Goedgebeur, and H. Mélot (also see Discrete Applied Mathematics, 161(1-2): 311-314, 2013 (DOI)).
- Search for Graphs
Blog: A Repository of Benchmark Graph Datasets for Graph Classification by Shiruipan
Blog: Collection and Streaming of Graph Datasets by Yibo Yao
Blog: Big Graph Data Sets by Yongming Luo
Blog: MIVIA LDGraphs Dataset: The MIVIA LDGraphs (MIVIA Large Dense Graphs) dataset is a new dataset for benchmarking exact graph matching algorithms. It aims to extend the MIVIA graphs dataset, widely used in the last ten years, with bigger and more dense graphs, so as to face with the problems nowadays encountered in real applications devoted for instance to bioinformatics and social network analysis.
Blog: Datasets by Marion Neumann
Blog: Graph Dataset by Xiao Meng
Blog: Constructors and Databases of Graphs in Sage
Datasets in GitHub:
- Benchmark Dataset for Graph Classification: This repository contains datasets to quickly test graph classification algorithms, such as Graph Kernels and Graph Neural Networks by Filippo Bianchi.
- GAM: A PyTorch implementation of “Graph Classification Using Structural Attention” (KDD 2018) by Benedek Rozemberczki.
- CapsGNN: A PyTorch implementation of “Capsule Graph Neural Network” (ICLR 2019) by Benedek Rozemberczki.

### Tools for Creating Graphs

Package: Networkx: a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
Package: Sage: a viable free open source alternative to Magma, Maple, Mathematica and Matlab.
- CoCalc: an online service for running SageMath computations online to avoid your own installation of Sage. CoCalc will allow you to work with multiple persistent worksheets in Sage, IPython, LaTeX, and much, much more!
- Graph Theory in Sage

Data Science Competition Platforms

Kaggle
- Kaggle Competition Past Solutions
- Kaggle Past Solutions by Eliot Andres
- The Tips and Tricks I Used to Succeed on Kaggle by Vik Paruchuri
DrivenData
TunedIT
InnoCentive
CrowdAnalytix