This practical sessions uses Scikit-learn and Keras (with the Tensor Flow backend) for the last exercise on "deep learning". The first 4 exercises are inspired from practical sessions given in French at Aix-Marseille University. Scikit-learn and Keras are free software machine learning libraries for the Python programming language. They are designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. A nice textbook on Python for Data Science is available on Github.
To install Scikit-learn and Keras (+Tensor Flow) on your own computer, I suggest to start by installing Anaconda which includes NumPy, SciPy and Scikit-learn. To visualize some models (in particular decision trees) in sklearn you will also need to install Graphviz which can also be done using Anaconda with this instruction. Keras installation is also easy with Anaconda with this instruction. To install Tensor Flow, follow this link.
You do not need a strong background in Python to do these practical sessions. However, if you want to learn the basics, you can follow this tutorial
or this one.
That to avoid using the standard Python interpreter from your terminal, you can use one of the many IDEs For Data Science. I would recommend the "Jupyter Notebook". It is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Jupyter is already installed in Anaconda, just type jupyter notebook
in your dev directory to use it.
The original format and the description of most of the datasets that are used during this lab session can be found on the UCI web page. We will use in particular, the ones provided in Scikit-learn dataset respository: iris, boston, diabetes, digits, linnerud, sample images, 20newsgroups. The datasets use a common set of attributes (there are not all allways defined): data, target, target names, feature names, DESCR.
.csv
) are available in "~/anaconda3/pkgs/scikit-learn-0.19.0-py36h4cafacf_2/lib/python3.6/site-packages/sklearn/datasets/" (it would be in an equivalent location on your computer). If you are curious about how the dataset attributes (data, target, target_names, feature_names, DESCR) mentioned before are loaded, you can look at the file "base.py" in this directory.
Iris-Setosa, Iris-Versicolour, Iris-Virginica
) of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. The attributes are the sepal_length
(in cm), the sepal_width
(in cm), the petal_length
(in cm) and the petal_width
(in cm).
Start python
(or ptpython
) in a terminal. The following commands help you to load the Iris dataset.
from sklearn.datasets import load_iris #data are loaded irisData=load_iris()To print the data attributes, you can type:
print(irisData.data) print(irisData.target) print(irisData.target_names) print(irisData.feature_names) print(irisData.DESCR)Execute the following commands.You can, of course, copy and paste the commands in your terminal (takes some notes for yourself on what they do, one command gives back a mistake, this is a traditional python mistake that you need to be aware of):
matplotlib
and in particular its module matplotlib.pyplot
can be used to visualize your data. "Pyplot provides the state-machine interface to the underlying plotting library in matplotlib. This means that figures and axes are implicitly and automatically created to achieve the desired plot. For example, calling plot from pyplot will automatically create the necessary figure and axes to achieve the desired plot. Setting a title will then automatically set that title to the current axes object".
from matplotlib import pyplot as plt # replace the name "pyplot" by "plt" X=irisData.data Y=irisData.target x=0 y=1
# -*- coding: utf-8 -*- from matplotlib import pyplot as plt from sklearn.datasets import load_iris irisData=load_iris() X=irisData.data Y=irisData.target x=0 y=1 colors=["red","green","blue"] for i in range(3): plt.scatter(X[Y==i][:, x],X[Y==i][:,y],color=colors[i],label=irisData.target_names[i]) plt.legend() plt.xlabel(irisData.feature_names[x]) plt.ylabel(irisData.feature_names[y]) plt.title("Iris Data - size of the sepals only") plt.show()In a terminal, launch the program by typing
python PS1prog1.py
.
fit(), predict() and score()
. They are used by all machine learning models implemented in the library. score()
gives an evaluation measure, which one is it?
neighbors
package.
In the following, the clf = neighbors.KNeighborsClassifier(n neighbors)
creates an object "K-NN classifier", clf.fit(X, Y)
uses the data to define the classifier, the command clf.predict
can be used to classify new examples whereas clf.predict_proba
estimates the probability of the given classification. clf.score
computes the global score of the classifier on a given dataset.
from sklearn import neighbors nb_neighb = 15 clf = neighbors.KNeighborsClassifier(nb_neighb) # to know more about the parameters, type help(neighbors.KNeighborsClassifier)
model_selection.train_test_split
). The metrics module provides a good number of evaluation criteria for your classifier.
from sklearn.model_selection import train_test_split import random # to generate random numbersExecute the commands and understand (and write some comments about) what they do:
KFold
which splits the original dataset X into n folds which are pairs of (training set, test set).
from sklearn.model_selection import KFold kf=KFold(n_splits=3,shuffle=True) for learn,test in kf.split(X): print("app : ",learn," test ",test)
shuffle=False
. What happened?# -*- coding: utf-8 -*- import random from sklearn.datasets import load_iris irisData=load_iris() X=irisData.data Y=irisData.target ....
tree
package.
Execute the commands and understand what they do:
from sklearn.datasets import load_iris iris=load_iris() from sklearn import tree
help(tree.DecisionTreeClassifier)
you will see all the parameters of this algorithm among which:
max_depth
: (default none) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.min_samples_split
:(default=2) The minimum number of samples required to split an internal node.max_leaf_nodes
: grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If "None" then the algorithm allows an unlimited number of leaf nodes.clf=tree.DecisionTreeClassifier(criterion="entropy",max_depth=3,max_leaf_nodes=5)
Note that, in linux, you can transformed the .dot
file into a .pdf
using the command dot -Tpdf tree.dot -o iris.pdf
. In Windows, you can start the graphviz
application and open the .dot
file from there.
You can include such code in a python program by using
os.system(f'dot -Tpdf tree_tpsk1.dot -o tree_tpsk1_nleave.pdf')
.
TPsk1.py
) which opens the IRIS dataset and trains a decision tree using the default parameters. Visualize this tree. How many leaves does it contain? Start the training phase again by gradually decreasing the number of leaves from 9 to 3 using the command clf=tree.DecisionTreeClassifier(max_leaf_nodes=xx)
and observe (and comment in your notebook) the resulting trees.TPsk2.py
) which opens the IRIS dataset and trains a decision tree using the Gini criterion (in gini-iris.dot
) and a second one using the entropy (in entropy-iris.dot
). Compare the two trees.TPsk3.py
) which creates a (new !) synthetic dataset using the command X,Y=make_classification(n_samples=100000,n_features=20,n_informative=15,n_classes=3)
(see HELP, you need to import the function from sklearn.datasets
), split the generated data into a learning set and a test set (30% for test). Learn a decision tree on the learning set using the command clf=tree.DecisionTreeClassifier(max_leaf_nodes=500*i)
(where "i" varies from 1 to 20) then print the score of the classifier (note that print("%6.4f" %x)
prints a floating point number "x", at least six characters wide, with four characters after the decimal point) on the learning AND on the test set. What do you notice? What is the name of the observed phenomenon?TPsk4.py
) but with trees of different depth using the command cclf=tree.DecisionTreeClassifier(max_depth=i)
(for i from 1 to 40).from sklearn.datasets import load_digits digits=load_digits() digits.data[0] digits.images[0] digits.data[0].reshape(8,8) digits.target[0] #you can see the pictures using this piece of code: from matplotlib import pyplot as plt plt.gray() plt.matshow(digits.images[0]) plt.show() #to count the number of examples of a particular class, you can use: Y=digits.target print(len(Y[Y==0]))Although it is much less efficient than Tensor Flow/Keras for deep neural networks, sklearn also includes a class MLPClassifier which implements a multi-layer perceptron (MLP) algorithm that is trained using Backpropagation.
from sklearn.neural_network import MLPClassifier X = digits.dataExecute the following commands and understand what they do:
clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2),random_state=1)
clf.fit(X, Y)
TPsk5.py
) which opens the DIGITS dataset and train a MLP using the default parameters and a small number of hidden layers. Split the data into a learning set and a test set (30% for test) and report the score obtained by your classifier on the test set.
that you can find here. The aim of the previous question is to implement this yourself (so please do not use the command).
The MNIST dataset is the big brother version of the Digit dataset used in the former Exercise (70000 samples of size 28x28 pixels for MNIST).
This tutorial will show you how to:def baseline_model(): # create model model = Sequential() model.add(Dense(num_pixels, input_dim=num_pixels, kernel_initializer='normal', activation='relu')) model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model
As an exercise, try to create with Keras exactly the same network as the one you created with Scikit-learn in the previous exercise. Compare the obtained results with both networks on the same data (digit)-and on the two different datasets (digit for the first model with Scikit-learn, MNIST for the second with Keras) to evaluate how important the number of data is for this task.
2) Create a simple CNN for MNIST including Convolutional layers, Pooling layers and Dropout layers.def baseline_model(): # create model model = Sequential() model.add(Conv2D(32, (5, 5), input_shape=(1, 28, 28), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.2)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dense(num_classes, activation='softmax')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model3) Create a larger CNN architecture with additional convolutional, max pooling layers and fully connected layers. The network topology can be summarized as follows.
# define the larger model def larger_model(): # create model model = Sequential() model.add(Conv2D(30, (5, 5), input_shape=(1, 28, 28), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(15, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.2)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dense(50, activation='relu')) model.add(Dense(num_classes, activation='softmax')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model