Python for computational statistics and data science

Python is a widely used, general purpose programming language. Starting to program with Python is a good point. Python provides simple programming syntax and a lot of APIs, which we can use to expand our program.

To use Python on your computer, you can download and install it from https://www.python.org/downloads/ if your OS does not yet have it installed. After completing the installation, we can run the Python program via Terminal, or the Command Prompt on the Windows platform, by typing the following command:

$ python

Tip

Note: remove $ sign. Just type python on Terminal. This is applicable to Python 2.x.

Once you have executed the command, you should see the Python command prompt, as shown in the following screenshot:

If you installed Python 3, you usually run the program using the following command:

$ python3

You should see the Python 3 shell on your Terminal:

What's next?

There are lots of Python resources to help you learn how to write programs using Python. I recommend to reading the Python documents at https://www.python.org/doc/. You can also read Python books to accelerate your learning. This book does not cover topics about the basic Python programming language.

Python libraries for computational statistics and data science

Python has big communities. They help their members to learn and share. Several community members have been open sources related to computational statistics and data science, which can be used for our work. We will use these libraries for our implementation.

The following are several Python libraries for statistics and data science.

NumPy

NumPy is a fundamental package for efficient scientific computing in Python. This library has capabilities for handling N-dimensional arrays and integrating C/C++ and Fortran code. It also provides features for linear algebra, Fourier transform, and random number.

The official website for NumPy can be found at http://www.numpy.org.

Pandas

Pandas is a library for handling table-like structures called DataFrame objects. This has powerful and efficient numerical operations similar to NumPy's array object.

Further information about pandas can be found at http://pandas.pydata.org.

SciPy

SciPy is an expansion of the NumPy library. It contains functions for linear algebra, interpolation, integration, clustering, and so on.

The official website can be found at http://scipy.org/scipylib/index.html.

Scikit-learn

Scikit-learn is the most popular machine learning library for Python. It provides many functionalities, such as preprocessing data, classification, regression, clustering, dimensionality reduction, and model selection.

Further information about Scikit-learn can be found at http://scikit-learn.org/stable/.

Shogun

Shogun is a machine learning library for Python, which focuses on large-scale kernel methods such as support vector machines (SVMs). This library comes with a range of different SVM implementations.

The official website can be found at http://www.shogun-toolbox.org.

SymPy

SymPy is a Python library for symbolic mathematical computations. It has capabilities in calculus, algebra, geometry, discrete mathematics, quantum physics, and more.

The official website can be found at http://www.sympygamma.com.

Statsmodels

Statsmodels is a Python module we can use to process data, estimate statistical models and test data.

You can find out more about Statsmodels by visiting the official website at http://statsmodels.sourceforge.net.