This practical sessions uses Scikit-learn (and optionally Tensor Flow for the neural network part). It is inspired from practical sessions given in French at Aix-Marseille University.
Scikit-learn is a free software machine learning library for the Python programming language. It is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. A nice textbook on Python for Data Science is available on Github. TensorFlow is an open source software library for numerical computation using data flow graphs. It also uses the Python programming language.
To install Scikit-learn and Tensor Flow on your own computer, I suggest to start by installing ANACONDA which includes NumPy, SciPy and Scikit-learn. To visualize some models (in particular decision trees) in sklearn you will also need to install Graphviz. You can download Tensor Flow from this page.
You do not need a strong background in Python to do this practical session. However, if you want to learn the basics, you can follow this tutorial
or this one.
Note that if you do not want to use the standard Python interpreter from your terminal, you can use ptpython
(install it using pip install ptpython
) that will provide better completion, colors, edition help, ... Another interesting alternative is the "Jupyter Notebook". It is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Jupyter is already installed in Anaconda, just type jupyter notebook
in your dev directory to use it.
To be graded (part of the practical grade), you are expected to upload on Claroline connect (course "Machine Learning - Fundamentals and algorithms (DSC+MLDM) / Pattern Recognition (CIMET+3DMT)) no more than two days after the session, an archive (zip, tar) with a report (4-5 pages) that records for each exercise your results for the lines that are numbered. It is more important to write down what the command do than what the actual result is (in particular, do not paste answers that are very long, such as entire datasets). This archive should also contain a number of Python programs explicitly asked in the text of the session.
The original format and the description of most of the datasets that are used during this lab session can be found on the UCI web page. We will use in particular, the ones provided in Scikit-learn dataset respository: iris, boston, diabetes, digits, linnerud, sample images, 20newsgroups. The datasets use a common set of attributes (there are not all allways defined): data, target, target names, feature names, DESCR.
.csv
) are available in "~/anaconda/pkgs/scikit-learn-0.18.1-np111py36_1/lib/python3.6/site-packages/sklearn/datasets/" (it would be in an equivalent location on your computer). If you are curious about how the dataset attributes (data, target, target_names, feature_names, DESCR) mentioned before are loaded, you can look at the file "base.py" in this directory.
Iris-Setosa, Iris-Versicolour, Iris-Virginica
) of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. The attributes are the sepal_length
(in cm), the sepal_width
(in cm), the petal_length
(in cm) and the petal_width
(in cm).
Start python
(or ptpython
) in a terminal. The following commands help you to load the Iris dataset.
from sklearn.datasets import load_iris #data are loaded irisData=load_iris()To print the data attributes, you can type:
print(irisData.data) print(irisData.target) print(irisData.target_names) print(irisData.feature_names) print(irisData.DESCR)Execute the following commands and understand what they do (write down the answer in your report for the lines that are numbered); you can, of course, copy and paste the commands in your terminal):
matplotlib
and in particular its module matplotlib.pyplot
can be used to visualize your data. "Pyplot provides the state-machine interface to the underlying plotting library in matplotlib. This means that figures and axes are implicitly and automatically created to achieve the desired plot. For example, calling plot from pyplot will automatically create the necessary figure and axes to achieve the desired plot. Setting a title will then automatically set that title to the current axes object".
from matplotlib import pyplot as plt # replace the name "pyplot" by "plt" X=irisData.data Y=irisData.target x=0 y=1
from matplotlib import pyplot as plt # replace the name "pyplot" by "plt" X=irisData.data Y=irisData.target x=0 y=1
# -*- coding: utf-8 -*- from matplotlib import pyplot as plt from sklearn.datasets import load_iris irisData=load_iris() X=irisData.data Y=irisData.target x=0 y=1 colors=["red","green","blue"] for i in range(3): plt.scatter(X[Y==i][:, x],X[Y==i][:,y],color=colors[i],label=irisData.target_names[i]) plt.legend() plt.xlabel(irisData.feature_names[x]) plt.ylabel(irisData.feature_names[y]) plt.title("Iris Data - size of the sepals only") plt.show()In a terminal, launch the program by typing
python PS1prog1.py
datasets.make_classification(n_samples=25, n_features=4, n_informative=2, n_redundant=2, n_classes=2)
(the command has many more possible parameters). The command returns an array X of shape [n_samples, n_features] which contains the generated samples and an array Y of shape [n_samples] which contains the integer labels for the class membership of each sample.
The dataset that you have created will be use in the last exercise. You can load any text file using Numpy and the command numpy.loadtxt
described in this page or you can load a .csv
file using:
from numpy import genfromtxt my_data = genfromtxt('my_file.csv', delimiter=',')
neighbors
package.
In the following, the clf = neighbors.KNeighborsClassifier(n neighbors)
creates an object "KNN classifier", clf.fit(X, Y)
uses the data to define the classifier, the command clf.predict
can be used to classify new examples whereas clf.predict_proba
estimates the probability of the given classification. clf.score
computes the global score of the classifier on a given dataset.
from sklearn import neighbors nb_neighb = 15 help(neighbors.KNeighborsClassifier) clf = neighbors.KNeighborsClassifier(nb_neighb) help(clf.fit)
model_selection.train_test_split
). The metrics module provides a good number of evaluation criteria for your classifier.
from sklearn.model_selection import train_test_split import random # to generate random numbersExecute the commands and understand (and write in your report) what they do:
KFold
which splits the original dataset X into n folds which are pairs of (training set, test set).
from sklearn.model_selection import KFold kf=KFold(n_splits=10,shuffle=True) for learn,test in kf.split(X): print("app : ",learn," test ",test)
shuffle=False
. What happened?# -*- coding: utf-8 -*- import random from sklearn.datasets import load_iris irisData=load_iris() X=irisData.data Y=irisData.target from sklearn import neighbors from sklearn.model_selection import KFold kf=KFold(n_splits=10,shuffle=True) scores=[] for k in range(1,30): score=0 clf = neighbors.KNeighborsClassifier(k) for learn,test in kf.split(X): X_train=X[learn] Y_train=Y[learn] clf.fit(X_train, Y_train) X_test=X[test] Y_test=Y[test] score = score + clf.score(X_test,Y_test) scores.append(score) print(scores) print("best k:",scores.index(max(scores))+1)
kf=KFold(n_splits=10,shuffle=True)
by kf=KFold(n_splits=3,shuffle=False)
?tree
package.
Execute the commands and understand (and write in your report) what they do:
from sklearn.datasets import load_iris iris=load_iris() from sklearn import tree
help(tree.DecisionTreeClassifier)
you will see all the parameters of this algorithm among which:
max_depth
: (default none) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.min_samples_split
:(default=2) The minimum number of samples required to split an internal node.max_leaf_nodes
: grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If "None" then the algorithm allows an unlimited number of leaf nodes.clf=tree.DecisionTreeClassifier(criterion="entropy",max_depth=3,max_leaf_nodes=5)
Note that, in linux, you can transformed the .dot
file into a .pdf
using the command dot -Tpdf tree.dot -o iris.pdf
. In Windows, you can start the graphviz
application and open the .dot
file from there.
TP1prog3.py
which opens the IRIS dataset and trains a decision tree using the default parameters. Visualize this tree. How many leaves does it contain? Start the training phase again by gradually decreasing the number of leaves from 9 to 3 using the command clf=tree.DecisionTreeClassifier(max_leaf_nodes=xx)
and observe (and describe) the resulting trees.TP1prog4.py
which opens the IRIS dataset and trains a decision tree using the Gini criterion (in gini-iris.dot
) and a second one using the entropy (in entropy-iris.dot
). Compare the two trees.TP1prog5.py
which creates a dataset using the command X,Y=make_classification(n_samples=100000,n_features=20,n_informative=15,n_classes=3)
, split the generated data into a learning set and a test set (30% for test). Learn a decision tree on the learning set using the command clf=tree.DecisionTreeClassifier(max_leaf_nodes=500*i)
(where "i" varies from 1 to 20) then print the score of the classifier (note that print("%6.4f" %x)
prints a floating point number "x", at least six characters wide, with four characters after the decimal point) on the learning AND on the test set. What do you notice? What is the name of the observed phenomenon?TP1prog6.py
) but with trees of different depth using the command cclf=tree.DecisionTreeClassifier(max_depth=i)
(for i from 1 to 40).from sklearn.datasets import load_digits digits=load_digits() digits.data[0] digits.images[0] digits.data[0].reshape(8,8) digits.target[0] #you can see the pictures using this piece of code: from matplotlib import pyplot as plt plt.gray() plt.matshow(digits.images[0]) plt.show() #to count the number of examples of a particular class, you can use: Y=digits.target print(len(Y[Y==0]))Although it is much less efficient than Tensor Flow for deep neural networks, sklearn also includes a class MLPClassifier which implements a multi-layer perceptron (MLP) algorithm that is trained using Backpropagation.
from sklearn.neural_network import MLPClassifier X = digits.dataExecute the following commands and understand (and write in your report) what they do:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2),random_state=1)
clf.fit(X, Y)
TP1prog7.py
which opens the DIGITS dataset and train a MLP using the default parameters and a small number of hidden layers. Split the data into a learning set and a test set (30% for test) and report the score obtained by your classifier on the test set. Change at least 3 parameters of your classifier (ex: number of neurones, number of layers, learning rate) and report the score your obtain on your test set for each version. What can you conclude?