The Webpage of This Repository: Tools in Data Science
Data Science Center, Shahid Beheshti University
In this repository, we introduce some videos, slides, notebooks, and papers about some of
important tools in data science and also some tools to write or share your projects.
Index:
Command Line:
Anaconda:
Anaconda Distribution: With over 6 million users, the open source Anaconda Distribution is the fastest and easiest way to do Python and R data science and machine learning on Linux, Windows, and Mac OS X. It’s the industry standard for developing, testing, and training on a single machine.
Integrated Development Environment:
Python IDEs and Code Editors (Guide) by by Jon Fincher
- IDE: An IDE (or Integrated Development Environment) is a program dedicated to software development. As the name implies, IDEs integrate several tools specifically designed for software development. These tools usually include:
- An editor designed to handle code (with, for example, syntax highlighting and auto-completion)
- Build, execution, and debugging tools
- Some form of source control
- Most IDEs support many different programming languages and contain many more features. They can, therefore, be large and take time to download and install. You may also need advanced knowledge to use them properly.
- Top Python IDEs For Data Science (My Recommendation):
Colaboratory (a WEB IDE):
- Welcome to Colaboratory!
Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser.
Jupyter and IPython (a WEB IDE):
The Jupyter is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more. Also, IPython provides a rich architecture for interactive computing with in multiple programming languages.
Jupyter Lab (a WEB IDE):
R NoteBook (a WEB IDE):
Markdown:
Markdown is a lightweight markup language that you can use to add formatting elements to plaintext text documents. Created by John Gruber in 2004, Markdown is now one of the world’s most popular markup languages.
R Markdown:
Working with Data:
Git:
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. Git is easy to learn and has a tiny footprint with lightning fast performance.
Git Resources:
Videos:
Git Cheat Sheets:
Slides:
GitHub:
Docker:
Docker provides a simple and powerful developer experience, workflows and collaboration for creating applications.
Programming Languages:
Python:
You can learn python via SoloLearn (A great website for getting started with coding. It offers easy to follow lessons, interspersed with quizzes to help you retain what you are learning). Also, we recommend the following references:
Useful Tricks in Python:
Useful Modules in Python:
- Argparse: The argparse module makes it easy to write user-friendly command-line interfaces.
- Warning Control: Warning messages are typically issued in situations where it is useful to alert the user of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program.
- PDB Module: This module defines an interactive source code debugger for Python programs.
- Pickle Module: This module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
R:
Useful R Sites:
Useful R Tricks:
Machine Learning in R:
Useful Machine Learning Sites in R:
Practice Code:
If you want to solve interesting problems to practice Python or R, then we recommend to solve the following problems:
SQL:
SQL is a a domain-specific language for managing data in databases.
Python Libraries for Data Science:
Python continues to take leading positions in solving data science tasks and challenges. Kdnuggets introduced 20 libraries of Python for data science. The following table was adopted from Applied Machine Learning and Deep Learning created by Cuixian Chen. Here are five of the most important of libraries:
Numpy:
NumPy is the fundamental package for scientific computing with Python. Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
Pandas:
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Matplotlib:
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python and IPython shells, the Jupyter notebook, web application servers, and four graphical user interface toolkits.
Scikit-Learn:
Scikit-Learn is a simple and efficient tools for data mining and data analysis. It was built on NumPy, SciPy, and Matplotlib.
SciPy:
SciPy (pronounced “Sigh Pie”) is open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more.
Probabilistic Programming in Python:
PyMC3 allows you to write down models using an intuitive syntax to describe a data
generating process.
A Fascinating Guide For Machine Learning: