Data Handling
Here we introduce several blogs related to data and data handling and also some resources of datasets.
Dealing with Data
Datasets
The following resources may be helpful for those still undecided about their course projects.
To know more datasets, refer to the following webpage of KDnuggets:
Datasets of Molecules and Their Properties
- Blog: MoleculeNet is a benchmark specially designed for testing machine learning methods of molecular properties. As we aim to facilitate the development of molecular machine learning method, this work curates a number of dataset collections, creates a suite of software that implements many known featurizations and previously proposed algorithms. All methods and datasets are integrated as parts of the open source DeepChem package(MIT license).
- Blog: ChEMBL is a manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
- Blog: Tox21: The 2014 Tox21 data challenge was designed to help scientists understand the potential of the chemicals and compounds being tested through the Toxicology in the 21st Century initiative to disrupt biological pathways in ways that may result in toxic effects. The Tox21 Program (Toxicology in the 21st Century) is an ongoing collaboration among federal agencies to characterize the potential toxicity of chemicals using cells and isolated molecular targets instead of laboratory animals.
Datasets of Graphs
- Blog: Network Repository. An Interactive Scientific Network Data Repository: The first interactive data and network data repository with real-time visual analytics. Network repository is not only the first interactive repository, but also the largest network repository with thousands of donations in 30+ domains (from biological to social network data). This repository was made by Ryan A. Rossi and Nesreen K. Ahmed.
- Blog: Graph Classification: The mission of Papers With Code is to create a free and open resource with Machine Learning papers, code and evaluation tables.
- Blog: Graph Challenge Data Sets: Amazon is making the Graph Challenge data sets available to the community free of charge as part of the AWS Public Data Sets program. The data is being presented in several file formats, and there are a variety of ways to access it.
- Blog: The House of Graphs: a database of interesting graphs by G. Brinkmann, K. Coolsaet, J. Goedgebeur, and H. Mélot (also see Discrete Applied Mathematics, 161(1-2): 311-314, 2013 (DOI)).
- Blog: A Repository of Benchmark Graph Datasets for Graph Classification by
Shiruipan
- Blog: Collection and Streaming of Graph Datasets by Yibo Yao
- Blog: Big Graph Data Sets by Yongming Luo
- Blog: MIVIA LDGraphs Dataset: The MIVIA LDGraphs (MIVIA Large Dense Graphs) dataset is a new dataset for benchmarking exact graph matching algorithms. It aims to extend the MIVIA graphs dataset, widely used in the last ten years, with bigger and more dense graphs, so as to face with the problems nowadays encountered in real applications devoted for instance to bioinformatics and social network analysis.
- Blog: Datasets by Marion Neumann
- Blog: Graph Dataset by Xiao Meng
- Blog: Constructors and Databases of Graphs in Sage
- Datasets in GitHub:
- Benchmark Dataset for Graph Classification: This repository contains datasets to quickly test graph classification algorithms, such as Graph Kernels and Graph Neural Networks by Filippo Bianchi.
- GAM: A PyTorch implementation of “Graph Classification Using Structural Attention” (KDD 2018) by Benedek Rozemberczki.
- CapsGNN: A PyTorch implementation of “Capsule Graph Neural Network” (ICLR 2019) by Benedek Rozemberczki.
### Tools for Creating Graphs
- Package: Networkx: a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
- Package: Sage: a viable free open source alternative to Magma, Maple, Mathematica and Matlab.
- CoCalc: an online service for running SageMath computations online to avoid your own installation of Sage. CoCalc will allow you to work with multiple persistent worksheets in Sage, IPython, LaTeX, and much, much more!
- Graph Theory in Sage