Practical Session: introduction to WEKA (3-4h)
The lab session will be done using
the WEKA
Java-implemented machine learning tool. WEKA has implemented many
classification, regression and clustering algorithms that you have
seen during the Machine Learning ("Apprentissage Automatique") and Data Mining lectures. WEKA can be freely downloaded from here (for Linux, MAC and Windows).
For each exercise, write down a small report for yourself (4 or 5 lines) that records your results and
their analysis.
The original format and the description of most of the datasets
that are used during those lab sessions can be found on
the UCI web page.
You can use Wikipedia as a source of
information to remember some learning algorithms or some concepts such
as cross-validation, inductive
bias, etc.
Exercise 1: Manipulation of WEKA data
During this exercise you will learn a part of WEKA's graphical interface (GUI), Explorer (click on the "EXPLORER" button to start it). You could also try some algorithms on your data directly using the command line or use the GUI to create the list of parameters (for example in case of the J48 class) and then use those parameters in the command line. In the main WEKA interface, click on the "Simple CLI" button to start the command line interface.
You can find information about the ARFF data format here.
For all other information, including the use of the Explorer or of the command-line interface, please, refer to the Weka Manual.
- (Q) Choose the kind of preference you would like to explain,
e.g.: actress/actors, food, cars, music bands,
or anything you want.
- (Q) Create a data file in arff format containing about 20 entries, each
described by about 4 attributes and the
last attribute containing your preference, e.g.
@relation food
@attribute calories numeric
@attribute taste {sweet, sour, bitter, salty}
@attribute vegetarian {yes, no}
@attribute like_it {yes, no}
100, sweet, yes, yes % icecream
80, bitter, yes, yes % beer
2, sweet, yes, no % tic-tac
- (Q) Using the Explorer (GUI), compare 3 algorithms for classification of your data: decision trees (e.g.
rule learner (e.g. weka.classifiers.rules.PART
), and naive Bayes (weka.classifiers.bayes.NaiveBayes
). For this, choose the "classify" tab and then choose for example classifiers->tree->J48
(implementation of C4.5 decision tree algorithm). For each algorithm check what is the percentage of "Correctly Classified Instances" in "Stratified cross-validation"
(which algorithm can explain your personality best), and see the generated
rules (do they tell you anything interesting?).
Exercise 2: Decision Tree Learning
We would like to train a computer to learn the concept of different animal genera (mammals, fish, insects and so on).
Our input is file zoo.arff.
- Open file zoo.arff in WEKA.
- (Q) Remove what you think are unnecessary attributes by choosing them and pressing
remove (how many attributes did you remove?).
- Save the dataset with the removed attributes (call it zoo-removed.arff).
- Select the J48 classifier in the Explorer/classify tab to classify your data. We will use 66% of
animals to train the computer and 34% to evaluate the
classifier. For this choose percentage split 66% option.
- (Q) Build a decision tree by clicking on the "start" button. The result
appears on the left and as a line in the "history" list. To see the
tree, right-click on the line in the history list and choose
"visualise tree". In the tree window right-click and adjust the size
of the tree using menu options. In the results panel below the tree itself you see the estimation of the tree predictive performance.
- (Q) You can see record values of the correctly classified instances (%), true positives, true negatives and false positives. What do they refer to ? What are they used for ?
- (Q) Repeat experiment with Use training set option. The evaluation
is performed on the training set itself, it is highly optimistic and
represents an upper bound of the performance you can achieve with
this model.
- (Q) Repeat experiment with 10 folds cross-validation (the set is
divided into 10 parts: 9 parts are used for training and 1 for
testing. The process is repeated 10 times and averaged) and 5 folds cross-validation.
From the results draw the conclusion of the ability of J48
classifier to extract concept from a given dataset. What is the
predictive performance of the model (on the scale : good-bad) and
whether the performance depends on the random selection of the
subset of the training data?
Predicting new instances with the decision tree
While the GUI version of WEKA is nice for visualizing the results and
setting the parameters using forms, when it comes to building a
classification (or predictions) model and then applying it to new
instances, the most direct and flexible approach is to use the command
line. Note that command lines can be included in any home-made software (in particular if you need it in your professional life).
Let zoo_test.arff
be our new test set. Notice (by looking at the dataset) that this
time, the value of the last attribute ("type") is unknown ("?"). Be careful to have the same attributes in the chosen training set and in your test set.
- The main command for generating the classification model as we did above is:
java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-path/zoo-removed.arff -d directory-path/zoo.model
The options -C 0.25
and -M 2
in the above
command are the same options that we selected for J48 classifier in
the previous GUI example. The -t option in the command specifies that
the next string is the full directory path to the training file (in
this case "zoo.arff"). In the above command directory-path should be
replaced with the full directory path where the training file
resides. Finally, the -d option specifies the name (and location)
where the model will be stored. After executing this command inside
the "Simple CLI" interface, you should see the tree and stats about
the model in the top window.
- Based on the above command, our classification model has been
stored in the file "zoo.model" and placed in the directory we
specified. We can now apply this model to the new instances. The
advantage of building a model and storing it is that it can be
applied at any time to different sets of unclassified instances. The
command for doing so is:
java weka.classifiers.trees.J48 -p 17 -l directory-path/zoo.model -T directory-path/zoo_test.arff
The option -p 17 indicates that we want to predict a value for
attribute number 17 ("type"). The -l options specifies the
directory path and name of the model file (this is what was created
in the previous step). Finally, the -T option specifies the name
(and path) of the test data. In our example, the test data is our
new instances file "zoo_test.arff").
This command results in a 4-column output similar to the following:
inst# actual predicted error prediction
0 1:? 1:mammal 0.75
1 1:? 2:bird + 0.7272727272727273
2 1:? 1:mammal 0.95
3 1:? 3:reptile + 0.8813559322033898
The first column is the instance number assigned to the new instances
in "zoo_test.arff" by WEKA. The 2nd column is the actual "type" value in the test data (in this case, we
did not have a value for "type" in "zoo_test.arff", thus this value is
"?" and it is labelled with the majority class :1). The 3rd column is the predicted value of
the "type" attribute for the new instance. Finally, the 5th
column is the confidence (prediction accuracy) for that instance.
- (Q) What output do you get ? What does this confidence mean ?
Exercise 3: Choice of learning algorithm
We will have to solve two classification problems, with the following training sets:
The datasets are already in WEKA's ARFF format.
There are no testing sets, so to assess performance, please use 10-fold
You will compare performance of two sub-symbolic classifiers:
- Nearest Neighbour - to classify a new instance, it looks for the most
similar instance in the training set (with the shortest Euclidian distance),
and assigns the class of this closest instance. This classifier is
implemented in Weka in
- Perceptron - this classifier simply finds a line (note: line, not curve)
in two-dimensional feature space which most accurately divides instances from
the training set in two classes. During classification, it simply checks on
which side of the line a new instance lies. To generate this "Perceptron"
classifier in Weka, we will use a "degenerate" neural network with no hidden
weka.classifiers.functions.MultilayerPerceptron -H 0
- (Q) First, find the accuracy of the above classifiers on the the above
data sets (percentage of "Correctly Classified Instances" in
"Stratified cross-validation"). Can you say that one of the above
algorithms is in general better than the other?
- (Q) To understand and comment the results you can visualise the two data sets
in Weka Explorer.
Using discretization
The KNN and perceptron classifiers are designed to handle numerical
- (Q) Try to discretize the data beforehand using the
this case, the discretization uses an entropy-based method)
or weka.filters.unsupervised.attribute.Discretize
different numbers of bins in the "Preprocess panel".
For both
commands, the parameter -R col1,col2-col4,...
can be used
to specify the list of columns to discretize: "first" and "last" are
valid indexes (and the default ones for unsupervised
discretization). A very interesting paper from [Dougherty
et. al., 1995] which surveys the different discretization methods can be
found HERE.
NB : Note that the supervised discretization returns
"all" if it cannot find a meaningful split of the attribute with
relation to the target class.
- (Q) Use also a "symbolic" classifier (for example J48) on your
discretized data.
- (Q) What's the influence of the number of bins for unsupervised
classification ? (don't forget to reload the original (numeric)
relation or "undo" the discretization before applying another one).
- (Q) Does a symbolic classifier perform better than the numeric ones
after discretization ? What can you conclude ?