Practical Session: introduction to WEKA (3-4h)

The lab session will be done using the WEKA Java-implemented machine learning tool. WEKA has implemented many classification, regression and clustering algorithms that you have seen during the Machine Learning ("Apprentissage Automatique") and Data Mining lectures. WEKA can be freely downloaded from here (for Linux, MAC and Windows).

For each exercise, write down a small report for yourself (4 or 5 lines) that records your results and their analysis.

The original format and the description of most of the datasets that are used during those lab sessions can be found on the UCI web page.

You can use Wikipedia as a source of information to remember some learning algorithms or some concepts such as cross-validation, inductive bias, etc.

Exercise 1: Manipulation of WEKA data

During this exercise you will learn a part of WEKA's graphical interface (GUI), Explorer (click on the "EXPLORER" button to start it). You could also try some algorithms on your data directly using the command line or use the GUI to create the list of parameters (for example in case of the J48 class) and then use those parameters in the command line. In the main WEKA interface, click on the "Simple CLI" button to start the command line interface.
You can find information about the ARFF data format here.
For all other information, including the use of the Explorer or of the command-line interface, please, refer to the Weka Manual.
  1. (Q) Choose the kind of preference you would like to explain, e.g.: actress/actors, food, cars, music bands, or anything you want.
  2. (Q) Create a data file in arff format containing about 20 entries, each described by about 4 attributes and the last attribute containing your preference, e.g.
    @relation food
    @attribute calories numeric
    @attribute taste {sweet, sour, bitter, salty}
    @attribute vegetarian {yes, no}
    @attribute like_it {yes, no}
    @data
    100, sweet, yes, yes      % icecream
    80, bitter, yes, yes      % beer
    2, sweet, yes, no         % tic-tac
    ...
    
  3. (Q) Using the Explorer (GUI), compare 3 algorithms for classification of your data: decision trees (e.g. weka.classifiers.trees.J48), rule learner (e.g. weka.classifiers.rules.PART), and naive Bayes (weka.classifiers.bayes.NaiveBayes). For this, choose the "classify" tab and then choose for example classifiers->tree->J48 (implementation of C4.5 decision tree algorithm). For each algorithm check what is the percentage of "Correctly Classified Instances" in "Stratified cross-validation" (which algorithm can explain your personality best), and see the generated rules (do they tell you anything interesting?).

Exercise 2: Decision Tree Learning

We would like to train a computer to learn the concept of different animal genera (mammals, fish, insects and so on). Our input is file zoo.arff.
  1. Open file zoo.arff in WEKA.
  2. (Q) Remove what you think are unnecessary attributes by choosing them and pressing remove (how many attributes did you remove?).
  3. Save the dataset with the removed attributes (call it zoo-removed.arff).
  4. Select the J48 classifier in the Explorer/classify tab to classify your data. We will use 66% of animals to train the computer and 34% to evaluate the classifier. For this choose percentage split 66% option.
  5. (Q) Build a decision tree by clicking on the "start" button. The result appears on the left and as a line in the "history" list. To see the tree, right-click on the line in the history list and choose "visualise tree". In the tree window right-click and adjust the size of the tree using menu options. In the results panel below the tree itself you see the estimation of the tree predictive performance.
  6. (Q) You can see record values of the correctly classified instances (%), true positives, true negatives and false positives. What do they refer to ? What are they used for ?
  7. (Q) Repeat experiment with Use training set option. The evaluation is performed on the training set itself, it is highly optimistic and represents an upper bound of the performance you can achieve with this model.
  8. (Q) Repeat experiment with 10 folds cross-validation (the set is divided into 10 parts: 9 parts are used for training and 1 for testing. The process is repeated 10 times and averaged) and 5 folds cross-validation. From the results draw the conclusion of the ability of J48 classifier to extract concept from a given dataset. What is the predictive performance of the model (on the scale : good-bad) and whether the performance depends on the random selection of the subset of the training data?

    Predicting new instances with the decision tree

    While the GUI version of WEKA is nice for visualizing the results and setting the parameters using forms, when it comes to building a classification (or predictions) model and then applying it to new instances, the most direct and flexible approach is to use the command line. Note that command lines can be included in any home-made software (in particular if you need it in your professional life). Let zoo_test.arff be our new test set. Notice (by looking at the dataset) that this time, the value of the last attribute ("type") is unknown ("?"). Be careful to have the same attributes in the chosen training set and in your test set.

  9. The main command for generating the classification model as we did above is:
    java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-path/zoo-removed.arff -d directory-path/zoo.model
    
    The options -C 0.25 and -M 2 in the above command are the same options that we selected for J48 classifier in the previous GUI example. The -t option in the command specifies that the next string is the full directory path to the training file (in this case "zoo.arff"). In the above command directory-path should be replaced with the full directory path where the training file resides. Finally, the -d option specifies the name (and location) where the model will be stored. After executing this command inside the "Simple CLI" interface, you should see the tree and stats about the model in the top window.
  10. Based on the above command, our classification model has been stored in the file "zoo.model" and placed in the directory we specified. We can now apply this model to the new instances. The advantage of building a model and storing it is that it can be applied at any time to different sets of unclassified instances. The command for doing so is:
    java weka.classifiers.trees.J48 -p 17 -l directory-path/zoo.model -T directory-path/zoo_test.arff
    
    The option -p 17 indicates that we want to predict a value for attribute number 17 ("type"). The -l options specifies the directory path and name of the model file (this is what was created in the previous step). Finally, the -T option specifies the name (and path) of the test data. In our example, the test data is our new instances file "zoo_test.arff"). This command results in a 4-column output similar to the following:
    inst#	actual	predicted	error	prediction
    0	1:?	1:mammal		0.75 
    1	1:?	2:bird		+	0.7272727272727273 
    2	1:?	1:mammal		0.95 
    3	1:?	3:reptile	+	0.8813559322033898 
    
    
    The first column is the instance number assigned to the new instances in "zoo_test.arff" by WEKA. The 2nd column is the actual "type" value in the test data (in this case, we did not have a value for "type" in "zoo_test.arff", thus this value is "?" and it is labelled with the majority class :1). The 3rd column is the predicted value of the "type" attribute for the new instance. Finally, the 5th column is the confidence (prediction accuracy) for that instance.
  11. (Q) What output do you get ? What does this confidence mean ?

Exercise 3: Choice of learning algorithm

We will have to solve two classification problems, with the following training sets:

The datasets are already in WEKA's ARFF format. There are no testing sets, so to assess performance, please use 10-fold cross-validation.

You will compare performance of two sub-symbolic classifiers:

  1. Nearest Neighbour - to classify a new instance, it looks for the most similar instance in the training set (with the shortest Euclidian distance), and assigns the class of this closest instance. This classifier is implemented in Weka in weka.classifiers.lazy.IB1
  2. Perceptron - this classifier simply finds a line (note: line, not curve) in two-dimensional feature space which most accurately divides instances from the training set in two classes. During classification, it simply checks on which side of the line a new instance lies. To generate this "Perceptron" classifier in Weka, we will use a "degenerate" neural network with no hidden layers: weka.classifiers.functions.MultilayerPerceptron -H 0
  3. (Q) First, find the accuracy of the above classifiers on the the above data sets (percentage of "Correctly Classified Instances" in "Stratified cross-validation"). Can you say that one of the above algorithms is in general better than the other?
  4. (Q) To understand and comment the results you can visualise the two data sets in Weka Explorer.
  5. Using discretization

    The KNN and perceptron classifiers are designed to handle numerical attributes.
  6. (Q) Try to discretize the data beforehand using the commands weka.filters.supervised.attribute.Discretize (in this case, the discretization uses an entropy-based method) or weka.filters.unsupervised.attribute.Discretize for different numbers of bins in the "Preprocess panel". For both commands, the parameter -R col1,col2-col4,... can be used to specify the list of columns to discretize: "first" and "last" are valid indexes (and the default ones for unsupervised discretization). A very interesting paper from [Dougherty et. al., 1995] which surveys the different discretization methods can be found HERE.
    NB : Note that the supervised discretization returns "all" if it cannot find a meaningful split of the attribute with relation to the target class.
  7. (Q) Use also a "symbolic" classifier (for example J48) on your discretized data.
  8. (Q) What's the influence of the number of bins for unsupervised classification ? (don't forget to reload the original (numeric) relation or "undo" the discretization before applying another one).
  9. (Q) Does a symbolic classifier perform better than the numeric ones after discretization ? What can you conclude ?