Practical Sessions
Machine Learning with Scikit-learn and Keras (~ 5h)

This practical sessions uses Scikit-learn and Keras (with the Tensor Flow backend) for the last exercise on "deep learning". The first 4 exercises are inspired from practical sessions given in French at Aix-Marseille University. Scikit-learn and Keras are free software machine learning libraries for the Python programming language. They are designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy. A nice textbook on Python for Data Science is available on Github.

To install Scikit-learn and Keras (+Tensor Flow) on your own computer, I suggest to start by installing Anaconda which includes NumPy, SciPy and Scikit-learn. To visualize some models (in particular decision trees) in sklearn you will also need to install Graphviz which can also be done using Anaconda with this instruction. Keras installation is also easy with Anaconda with this instruction. To install Tensor Flow, follow this link.

You do not need a strong background in Python to do these practical sessions. However, if you want to learn the basics, you can follow this tutorial or this one.
That to avoid using the standard Python interpreter from your terminal, you can use one of the many IDEs For Data Science. I would recommend the "Jupyter Notebook". It is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Jupyter is already installed in Anaconda, just type jupyter notebook in your dev directory to use it.

The original format and the description of most of the datasets that are used during this lab session can be found on the UCI web page. We will use in particular, the ones provided in Scikit-learn dataset respository: iris, boston, diabetes, digits, linnerud, sample images, 20newsgroups. The datasets use a common set of attributes (there are not all allways defined): data, target, target names, feature names, DESCR.

.data is n*m dimensional array where n is the number of instances and m the number of attributes;
.target stores the class label of each instance (in a supervised setting)
.target names stores the name of the classes
.feature names stores the name of the attributes
.DESCR is a complete description of the the dataset in text format.

On my computer, the datasets (in .csv) are available in "~/anaconda3/pkgs/scikit-learn-0.19.0-py36h4cafacf_2/lib/python3.6/site-packages/sklearn/datasets/" (it would be in an equivalent location on your computer). If you are curious about how the dataset attributes (data, target, target_names, feature_names, DESCR) mentioned before are loaded, you can look at the file "base.py" in this directory.

Exercise 1: Python warm up with IRIS (Fisher, 1936): (~30 minutes)

IRIS is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. The data set contains 3 classes (Iris-Setosa, Iris-Versicolour, Iris-Virginica) of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. The attributes are the sepal_length (in cm), the sepal_width (in cm), the petal_length (in cm) and the petal_width (in cm).

Start python (or ptpython) in a terminal. The following commands help you to load the Iris dataset.

from sklearn.datasets import load_iris #data are loaded 
irisData=load_iris()

To print the data attributes, you can type:

print(irisData.data)
print(irisData.target)
print(irisData.target_names)
print(irisData.feature_names)
print(irisData.DESCR)

Execute the following commands.You can, of course, copy and paste the commands in your terminal (takes some notes for yourself on what they do, one command gives back a mistake, this is a traditional python mistake that you need to be aware of):

print(len(irisData.data))
help(len) # to quit the "help" press 'q'
print(irisData.target_names[0])
print(irisData.target_names[2])
print(irisData.target_names[-1])
print(irisData.target_names[len(irisData.target_names)])
print(irisData.data.shape)
print(irisData.data[0])
print(irisData.data[0,1]) # same as irisData.data[0][1] but more used in Python
print(irisData.data[1:3,1])
print(irisData.data[:,1])

The library matplotlib and in particular its module matplotlib.pyplot can be used to visualize your data. "Pyplot provides the state-machine interface to the underlying plotting library in matplotlib. This means that figures and axes are implicitly and automatically created to achieve the desired plot. For example, calling plot from pyplot will automatically create the necessary figure and axes to achieve the desired plot. Setting a title will then automatically set that title to the current axes object".
Execute the following commands and understand (and write in your report) what they do:

from matplotlib import pyplot as plt # replace the name "pyplot" by "plt" 
X=irisData.data
Y=irisData.target
x=0
y=1

plt.scatter(X[:, x], X[:, y],c=Y) # all the functions defined in a given library should be prefixed by the name of the library
plt.show()
plt.xlabel(irisData.feature_names[x])
plt.ylabel(irisData.feature_names[y])
plt.scatter(X[:, x], X[:, y],c=Y)
plt.show()

These are the same commands if you want to add a precise caption to your plot (this is usually useful when you want to report on some analyses):

print(Y==0)
print(X[Y==0])
print(X[Y==0][:, x])
plt.scatter(X[Y==0][:, x],X[Y==0][:, y], color="red",label=irisData.target_names[0])
plt.scatter(X[Y==1][:, x],X[Y==1][:, y], color="green",label=irisData.target_names[1])
plt.scatter(X[Y==2][:, x],X[Y==2][:, y], color="blue",label=irisData.target_names[2])
plt.legend()
plt.show()

And finally, create a file "PS1prog1.py" which contain the following program (here the aim is for you to execute a python file that can do the same commands without the command line):

# -*- coding: utf-8 -*-
from matplotlib import pyplot as plt
from sklearn.datasets import load_iris
irisData=load_iris()
X=irisData.data
Y=irisData.target
x=0 
y=1
colors=["red","green","blue"]
for i in range(3):
	plt.scatter(X[Y==i][:, x],X[Y==i][:,y],color=colors[i],label=irisData.target_names[i])
plt.legend()
plt.xlabel(irisData.feature_names[x]) 
plt.ylabel(irisData.feature_names[y])
plt.title("Iris Data - size of the sepals only") 
plt.show()

In a terminal, launch the program by typing python PS1prog1.py.

Exercise 2: K-NN classifier on IRIS (40 minutes)

We will now use the three most useful command in scikit-learn: fit(), predict() and score(). They are used by all machine learning models implemented in the library. score() gives an evaluation measure, which one is it?
You can find more information about the K-NN classifier in scikit-learn in this page. The KNN algorithm is implemented in the neighbors package. In the following, the clf = neighbors.KNeighborsClassifier(n neighbors) creates an object "K-NN classifier", clf.fit(X, Y) uses the data to define the classifier, the command clf.predict can be used to classify new examples whereas clf.predict_proba estimates the probability of the given classification. clf.score computes the global score of the classifier on a given dataset.
Execute the following commands and understand (I advise you to add this in comments to your python notebook) what they do:

from sklearn import neighbors
nb_neighb = 15
clf = neighbors.KNeighborsClassifier(nb_neighb) # to know more about the parameters, type help(neighbors.KNeighborsClassifier)

clf.fit(X, Y) #this obviously does not work if X and Y were not defined before in previous commands, to know more about the parameters help(clf.fit)
print(clf.predict([[ 5.4, 3.2, 1.6, 0.4]]))
print(clf.predict_proba([[ 5.4, 3.2, 1.6, 0.4]]))
print(clf.score(X,Y))
Z = clf.predict(X)
print(X[Z!=Y])

You have learned that the empirical score (ex: accuracy) given on the training set overestimates the true score of your classifier on unseen data. We thus need a set of data independent from your training data but generated in the same conditions to evaluate the classifier. We thus need to separate our data into 2 sets: the training and the test sets, train the classifier with the training set and evaluate it on the test set. However, if you have few data (as for Iris), this evaluation might be pessimistic (do you know why?). Scikit learn proposes a model selection package which allows you to split a data set into training/test (using model_selection.train_test_split). The metrics module provides a good number of evaluation criteria for your classifier.

from sklearn.model_selection import train_test_split 
import random # to generate random numbers

Execute the commands and understand (and write some comments about) what they do:

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=random.seed())
len(X_train)
len(X_test)
len(X_train[Y_train==0])
len(X_train[Y_train==1])
len(X_train[Y_train==2])
clf=clf.fit(X_train, Y_train)
Y_pred =clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)
print(cm)

Explains in a few lines what you can see (in general) in a confusion matrix.

Why did we chose k=15? How do we chose the best value for k? In general those hyper parameters are tuned using cross validation. The model_selection package offers the function KFold which splits the original dataset X into n folds which are pairs of (training set, test set).
Execute the following commands:

from sklearn.model_selection import KFold
kf=KFold(n_splits=3,shuffle=True)
for learn,test in kf.split(X):
	print("app : ",learn," test ",test)

Try again using the paramater shuffle=False. What happened?

Write down a small program (using the skeleton given below) to find automatically the best K hyperparameter for a a K-NN.

# -*- coding: utf-8 -*-
import random
from sklearn.datasets import load_iris
irisData=load_iris()
X=irisData.data
Y=irisData.target
....

Exercise 3: Decision Trees on IRIS (~50 minutes)

You can find more information about Decision Tree classifiers in scikit-learn in this page. The algorithm is implemented in the tree package. Execute the commands and understand what they do:

from sklearn.datasets import load_iris
iris=load_iris()
from sklearn import tree

clf=tree.DecisionTreeClassifier()
clf=clf.fit(iris.data,iris.target)
print(clf.predict([iris.data[50,:]]))
print(clf.score(iris.data,iris.target))
tree.export_graphviz(clf,out_file='tree.dot')

If you try the command help(tree.DecisionTreeClassifier) you will see all the parameters of this algorithm among which:

max_depth: (default none) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split:(default=2) The minimum number of samples required to split an internal node.
max_leaf_nodes: grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If "None" then the algorithm allows an unlimited number of leaf nodes.

For example, you can try the command clf=tree.DecisionTreeClassifier(criterion="entropy",max_depth=3,max_leaf_nodes=5) Note that, in linux, you can transformed the .dot file into a .pdf using the command dot -Tpdf tree.dot -o iris.pdf. In Windows, you can start the graphviz application and open the .dot file from there. You can include such code in a python program by using os.system(f'dot -Tpdf tree_tpsk1.dot -o tree_tpsk1_nleave.pdf').

You can now write a program (ex: TPsk1.py) which opens the IRIS dataset and trains a decision tree using the default parameters. Visualize this tree. How many leaves does it contain? Start the training phase again by gradually decreasing the number of leaves from 9 to 3 using the command clf=tree.DecisionTreeClassifier(max_leaf_nodes=xx) and observe (and comment in your notebook) the resulting trees.

Write a program (ex: TPsk2.py) which opens the IRIS dataset and trains a decision tree using the Gini criterion (in gini-iris.dot) and a second one using the entropy (in entropy-iris.dot). Compare the two trees.

Write a program (ex: TPsk3.py) which creates a (new !) synthetic dataset using the command X,Y=make_classification(n_samples=100000,n_features=20,n_informative=15,n_classes=3) (see HELP, you need to import the function from sklearn.datasets), split the generated data into a learning set and a test set (30% for test). Learn a decision tree on the learning set using the command clf=tree.DecisionTreeClassifier(max_leaf_nodes=500*i) (where "i" varies from 1 to 20) then print the score of the classifier (note that print("%6.4f" %x) prints a floating point number "x", at least six characters wide, with four characters after the decimal point) on the learning AND on the test set. What do you notice? What is the name of the observed phenomenon?

Same question (ex: TPsk4.py) but with trees of different depth using the command cclf=tree.DecisionTreeClassifier(max_depth=i) (for i from 1 to 40).

You can deduce from the two last questions that the depth of the tree and the max number of leaf are important parameters for the model. They are linked but different: if you control one you also control the other but it is difficult to know in advance which one you should work with. You must tune these parameters using cross validation to find the ones that are best suited to your problem.

Exercise 4: Neural Networks on DIGITS (~40 min)

The digits dataset contains 5620 instances that are described by 64 attributes (corresponding to the 8*8 images of integer pixels in the range 0(white)..16(black). The target is a digit between 0 and 9.
Execute the following program:

      
from sklearn.datasets import load_digits
digits=load_digits()
digits.data[0]
digits.images[0]
digits.data[0].reshape(8,8)
digits.target[0]
#you can see the pictures using this piece of code: 
from matplotlib import pyplot as plt
plt.gray()
plt.matshow(digits.images[0])
plt.show()
#to count the number of examples of a particular class, you can use:
Y=digits.target
print(len(Y[Y==0]))

Although it is much less efficient than Tensor Flow/Keras for deep neural networks, sklearn also includes a class MLPClassifier which implements a multi-layer perceptron (MLP) algorithm that is trained using Backpropagation.

    
from sklearn.neural_network import MLPClassifier
X = digits.data

Execute the following commands and understand what they do:

clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2),random_state=1) clf.fit(X, Y)
Write a program (ex: TPsk5.py) which opens the DIGITS dataset and train a MLP using the default parameters and a small number of hidden layers. Split the data into a learning set and a test set (30% for test) and report the score obtained by your classifier on the test set.
Neural Networks architectures are usually chosen from a restricted list of possible hyperparameters (testing everything is computationally too costly). This is usually done by grid search and set with cross validation. Find the best hyperparameters in terms of number of neurones, number of layers and learning rate by testing 3 possible relevant values for each of them (decide what is relevant). Create a program that can automatically choose these by cross-validation and report the best combination. In scikit-learn, this can be done with a command that you can find here. The aim of the previous question is to implement this yourself (so please do not use the command).

Exercise 5: Neural Networks on MNIST with Keras and Tensor Flow (~2 hour)

Keras is a high-level API to build and train deep learning models. It's used for fast prototyping, advanced research, and production. It can run on top of Tensor Flow or other deep learning frameworks.
There are a number of very good tutorials available about using keras on digit datasets. Today we will follow THIS ONE (from Jason Brownlee). The core data structure of Keras is a model, a way to organize layers. The simplest type of model is the Sequential model, a linear stack of layers (for forward models mainly).

The MNIST dataset is the big brother version of the Digit dataset used in the former Exercise (70000 samples of size 28x28 pixels for MNIST).

This tutorial will show you how to:
1) Create and test a simple model with one hidden layer with the same number of neurons as there are inputs (784). A rectifier activation function is used for the neurons in the hidden layer.

def baseline_model():
	# create model
	model = Sequential()
	model.add(Dense(num_pixels, input_dim=num_pixels, kernel_initializer='normal', activation='relu'))
	model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

As an exercise, try to create with Keras exactly the same network as the one you created with Scikit-learn in the previous exercise. Compare the obtained results with both networks on the same data (digit)-and on the two different datasets (digit for the first model with Scikit-learn, MNIST for the second with Keras) to evaluate how important the number of data is for this task.

2) Create a simple CNN for MNIST including Convolutional layers, Pooling layers and Dropout layers.

def baseline_model():
	# create model
	model = Sequential()
	model.add(Conv2D(32, (5, 5), input_shape=(1, 28, 28), activation='relu'))
	model.add(MaxPooling2D(pool_size=(2, 2)))
	model.add(Dropout(0.2))
	model.add(Flatten())
	model.add(Dense(128, activation='relu'))
	model.add(Dense(num_classes, activation='softmax'))
	# Compile model
	model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
	return model

3) Create a larger CNN architecture with additional convolutional, max pooling layers and fully connected layers. The network topology can be summarized as follows.

Convolutional layer with 30 feature maps of size 5x5.
Pooling layer taking the max over 2*2 patches.
Convolutional layer with 15 feature maps of size 3x3.
Pooling layer taking the max over 2*2 patches.
Dropout layer with a probability of 20%.
Flatten layer.
Fully connected layer with 128 neurons and rectifier activation.
Fully connected layer with 50 neurons and rectifier activation.
Output layer.