- Large Scale Machine Learning with Python
- Bastiaan Sjardin Luca Massaron Alberto Boschetti
- 2039字
- 2021-07-14 10:39:48
Python packages
The packages that we are going to introduce in the present paragraph will be frequently used in the book. If you are not using a scientific distribution, we offer you a walkthrough on what versions you should decide on and how to install them quickly and successfully.
NumPy
NumPy, which is Travis Oliphant's creation, is at the core of every analytical solution in the Python language. It provides the user with multidimensional arrays along with a large set of functions to operate multiple mathematical operations on these arrays. Arrays are blocks of data arranged along multiple dimensions, which implement mathematical vectors and matrices. Arrays are useful not just to store data, but also for fast matrix operations (vectorization), which are indispensable when you wish to solve ad hoc data science problems.
- Website: http://www.numpy.org/
- Version at the time of writing: 1.11.1
- Suggested install command:
$ pip install numpy
Tip
As a convention that is largely adopted by the Python community, when importing NumPy, it is suggested that you alias it as np
:
import numpy as np
SciPy
An original project by Travis Oliphant, Pearu Peterson, and Eric Jones, SciPy completes NumPy's functionalities, offering a larger variety of scientific algorithms for linear algebra, sparse matrices, signal and image processing, optimization, fast Fourier transformation, and much more.
- Website: http://www.scipy.org/
- Version at the time of writing: 0.17.1
- Suggested install command:
$ pip install scipy
Pandas
Pandas deals with everything that NumPy and SciPy cannot do. In particular, thanks to its specific object data structures, DataFrames, and Series, it allows the handling of complex tables of data of different types (something that NumPy's arrays cannot) and time series. Thanks to Wes McKinney's creation, you will be able to easily and smoothly load data from a variety of sources, and then slice, dice, handle missing elements, add, rename, aggregate, reshape, and finally visualize it at your will.
- Website: http://pandas.pydata.org/
- Version at the time of writing: 0.18.0
- Suggested install command:
$ pip install pandas
Tip
Conventionally, pandas is imported as pd
:
import pandas as pd
Scikit-learn
Started as part of SciKits (SciPy Toolkits), Scikit-learn is the core of data science operations in Python. It offers all that you may need in terms of data preprocessing, supervised and unsupervised learning, model selection, validation, and error metrics. Expect us to talk at length about this package throughout the book.
Scikit-learn started in 2007 as a Google Summer of Code project by David Cournapeau. Since 2013, it has been taken over by the researchers at Inria (French Institute for Research in Computer Science and Automation).
Scikit-learn offers modules for data processing (sklearn.preprocessing
and sklearn.feature_extraction
), model selection and validation (sklearn.cross_validation
, sklearn.grid_search
, and sklearn.metrics
), and a complete set of methods (sklearn.linear_model
) in which the target value, being a number or probability, is expected to be a linear combination of the input variables.
- Website: http://scikit-learn.org/stable/
- Version at the time of writing: 0.17.1
- Suggested install command:
$ pip install scikit-learn
Tip
Note that the imported module is named sklearn
.
The matplotlib package
Originally developed by John Hunter, matplotlib is the library containing all the building blocks to create quality plots from arrays and visualize them interactively.
You can find all the MATLAB-like plotting frameworks inside the PyLab module.
- Website: http://matplotlib.org/
- Version at the time of writing: 1.5.1
- Suggested install command:
$ pip install matplotlib
You can simply import just what you need for your visualization purposes:
import matplotlib as mpl from matplotlib import pyplot as plt
Gensim
Gensim, programmed by Radim Řehůřek, is an open source package suitable to analyze large textual collections by the usage of parallel distributable online algorithms. Among advanced functionalities, it implements Latent Semantic Analysis (LSA), topic modeling by Latent Dirichlet Allocation (LDA), and Google's word2vec, a powerful algorithm to transform texts into vector features to be used in supervised and unsupervised machine learning.
- Website: http://radimrehurek.com/gensim/
- Version at the time of writing: 0.13.1
- Suggested install command:
$ pip install gensim
H2O
H2O is an open source framework for big data analysis created by the start-up H2O.ai (previously named as 0xdata). It is usable by R, Python, Scala, and Java programming languages. H2O easily allows using a standalone machine (leveraging multiprocessing) or Hadoop cluster (for example, a cluster in an AWS environment), thus helping you scale up and out.
- Website: http://www.h2o.ai
- Version at the time of writing: 3.8.3.3
In order to install the package, you first have to download and install Java on your system, (You need to have Java Development Kit (JDK) 1.8 installed as H2O is Java-based.) then you can refer to the online instructions provided at http://www.h2o.ai/download/h2o/python.
We can overview all the installation steps together in the following lines.
You can install both H2O and its Python API, as we have been using in our book, by the following instructions:
$ pip install -U requests $ pip install -U tabulate $ pip install -U future $ pip install -U six
These steps will install the required packages, and then we can install the framework, taking care to remove any previous installation:
$ pip uninstall h2o $ pip install h2o
In order to have installed the same version as we have in our book, you can change the last pip install
command with the following:
$ pip install http://h2o-release.s3.amazonaws.com/h2o/rel-turin/3/Python/h2o-3.8.3.3-py2.py3-none-any.whl
If you run into problems, please visit the H2O Google groups page, where you can get help with your problems:
https://groups.google.com/forum/#!forum/h2ostream
XGBoost
XGBoost is a scalable, portable, and distributed gradient boosting library (a tree ensemble machine learning algorithm). It is available for Python, R, Java, Scala, Julia, and C++ and it can work on a single machine (leveraging multithreading), both in Hadoop and Spark clusters.
- Website: https://xgboost.readthedocs.io/en/latest/
- Version at the time of writing: 0.4
Detailed instructions to install XGBoost on your system can be found at https://github.com/dmlc/xgboost/blob/master/doc/build.md.
The installation of XGBoost on both Linux and Mac OS is quite straightforward, whereas it is a little bit trickier for Windows users. For this reason, we provide specific installations steps to have XGBoost working on Windows:
- First of all, download and install Git for Windows (https://git-for-windows.github.io/).
- Then you need a Minimalist GNU for Windows (MinGW) compiler present on your system. You can download it from http://www.mingw.org/ according to the characteristics of your system.
- From the command line, execute the following:
$ git clone --recursive https://github.com/dmlc/xgboost $ cd xgboost $ git submodule init $ git submodule update
- Then, from the command line, copy the configuration for 64-bit systems to be the default one:
$ copy make\mingw64.mk config.mk
Alternatively, you can copy the plain 32-bit version:
$ copy make\mingw.mk config.mk
- After copying the configuration file, you can run the compiler, setting it to use four threads in order to speed up the compiling procedure:
$ make -j4
- Finally, if the compiler completed its work without errors, you can install the package in your Python by executing the following commands:
$ cd python-package $ python setup.py install
Theano
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multidimensional arrays efficiently. Basically, it provides you with all the building blocks that you need to create deep neural networks.
- Website: http://deeplearning.net/software/theano/
- Release at the time of writing: 0.8.2
The installation of Theano should be straightforward as it is now a package on PyPI:
$ pip install Theano
If you want the most updated version of the package, you can get them with GitHub cloning:
$ git clone git://github.com/Theano/Theano.git
Then you can proceed with the direct Python installation:
$ cd Theano $ python setup.py install
To test your installation, you can run the following from the shell/CMD and verify the reports:
$ pip install nose $ pip install nose-parameterized $ nosetests theano
If you are working on a Windows OS and the previous instructions don't work, you can try these steps:
- Install TDM-GCC x64 (http://tdm-gcc.tdragon.net/).
- Open the Anaconda command prompt and execute the following:
$ conda update conda $ conda update –all $ conda install mingw libpython $ pip install git+git://github.com/Theano/Theano.git
Tip
Theano needs libpython, which isn't compatible yet with version 3.5, so if your Windows installation is not working, that could be the likely cause.
In addition, Theano's website provides some information to Windows users that could support you when everything else fails:
http://deeplearning.net/software/theano/install_windows.html
An important requirement for Theano to scale out on GPUs is to install NVIDIA CUDA drivers and SDK for code generation and execution on GPU. If you do not know too much about the CUDA Toolkit, you can actually start from this web page in order to understand more about the technology being used:
https://developer.nvidia.com/cuda-toolkit
Therefore, if your computer owns an NVIDIA GPU, you can find all the necessary instructions in order to install CUDA using this tutorial page from NVIDIA itself:
http://docs.nvidia.com/cuda/cuda-quick-start-guide/index.html#axzz4A8augxYy
TensorFlow
Just like Theano, TensorFlow is another open source software library for numerical computation using data flow graphs instead of just arrays. Nodes in such a graph represent mathematical operations, whereas the graph edges represent the multidimensional data arrays (the so-called tensors) moved between the nodes. Originally, Google researchers, being part of the Google Brain Team, developed TensorFlow and recently they made it open source for the public.
- Website: https://github.com/tensorflow/tensorflow
- Release at the time of writing: 0.8.0
For the installation of TensorFlow on your computer, follow the instructions found at the following link:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/get_started/os_setup.md
Windows support is not present at the moment but it is in the current roadmap:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/resources/roadmap.md
For Windows users, a good compromise could be to run the package on a Linux-based virtual machine or Docker machine. (The preceding OS set-up page offers directions to do so.)
The sknn library
The sknn library (for extensions, scikit-neuralnetwork) is a wrapper for Pylearn2, helping you to implement deep neural networks without requiring you to become an expert on Theano. As a bonus, the library is compatible with the Scikit-learn API.
- Website: https://scikit-neuralnetwork.readthedocs.io/en/latest/
- Release at the time of publication: 0.7
- To install the library, just use the following command:
$ pip install scikit-neuralnetwork
Optionally, if you want to take advantage of the most advanced features such as convolution, pooling, or upscaling, you have to complete the installation as follows:
$ pip install -r https://raw.githubusercontent.com/aigamedev/scikit-neuralnetwork/master/requirements.txt
After installation, you also have to execute the following:
$ git clone https://github.com/aigamedev/scikit-neuralnetwork.git $ cd scikit-neuralnetwork $ python setup.py develop
As seen for XGBoost, this will make the sknn
package available in your Python installation.
Theanets
The theanets package is a deep learning and neural network toolkit written in Python and uses Theano to accelerate computations. Just as with sknn, it tries to make it easier to interface with Theano functionalities in order to create deep learning models.
- Website: https://github.com/lmjohns3/theanets
- Version at the time of writing: 0.7.3
- Suggested installation procedure:
$ pip install theanets
You can also download the current version from GitHub and install the package directly in Python:
$ git clone https://github.com/lmjohns3/theanets $ cd theanets $ python setup.py develop
Keras
Keras is a minimalist, highly modular neural networks library written in Python and capable of running on top of either TensorFlow or Theano.
- Website: http://keras.io/
- Version at the time of writing: 1.0.5
- Suggested installation from PyPI:
$ pip install keras
You can also install the latest available version (advisable as the package is in continuous development) using the following command:
$ pip install git+git://github.com/fchollet/keras.git
Other useful packages to install on your system
Concluding this long tour of the many packages that you will see in action among the pages of this book, we close with three simple, yet quite useful, packages, that need little presentation but need to be installed on your system: memory profiler, climate, and NeuroLab.
Memory profiler is a package monitoring memory usage by a process. It also helps dissecting memory consumption by a specific Python script, line by line. It can be installed as follows:
$ pip install -U memory_profiler
Climate just consists of some basic command-line utilities for Python. It can be promptly installed as follows:
$ pip install climate
Finally, NeuroLab is a very basic neural network package loosely based on the Neural Network Toolbox (NNT) in MATLAB. It is based on NumPy and SciPy, not Theano; consequently, do not expect astonishing performances but know that it is a good learning toolbox. It can be easily installed as follows:
$ pip install neurolab