The goal of this project is to be able to automatically recognize digits and/or letters drawn manually with different machine learning algorithms. You will work on data acquisition and data storage, the implementation and comparison of different machine learning algorithms (Naive Bayes, KNN, Decision Trees (or Random Forests), Neural Networks), the evaluation of the parameters (e.g. the weights of the edit distance and the number of nearest neighbors for KNN, etc.) and of the system (cross validation, test set, etc.). A nice Graphical User Interface (GUI) is a plus to be able to give a live demo of the system at the end of the project. You can choose any programming language.
You have to create your own dataset and a structure to store the data.
You may decide to create a real database system (using for example PostgreSQL) and store the images which contain a digit. With this, you can use and compare multiple representations of the image.
You may also decide to store only the Freeman code as textual information or as XML files as shown in Fig3.
Whatever the structure, you will find out early in your experimental process that more than one person should populate the dataset with some digits to make sure that the instances of each class are diverse enough (in term of writing style) to have a sufficiently good classification accuracy.
If you focus only on digits (and not also on letters), you may also re use the dataset Digits from the UCI repository that you have already used in your Scikit-learn practical session but this should come as a complement for your own home-made dataset.
There are multiple ways to encode a digit/character drawn on a screen for further processing. One of them consists in saving the digit and the frame in which it has been drawn as an image. From this image, a successful string representation has been proposed by Freeman1 to encode the digit.
To compute the Freeman code of an image, one needs to move along a digital curve or a sequence of border pixels based on eight-connectivities. The direction of each movement is encoded by using a numbering scheme {i,i = 0,1,..,7} denoting an angle of 45*i (counter-clockwise) from the positive y-axis. The chain codes can be viewed as a connected sequence of straight-line segments with specified lengths and directions as showed in Figure 1.
You may first need to compute the pixel hull (edges) of each drawn digit. Then, to create the code, you can start by the top left corner of your image and scan each pixel from left to right and top to bottom until you find a pixel which is not white. Then you follow the hull to create the code until you are back to your starting point. You can find here the pseudo code of the algorithm and here the corresponding Python code.
Algorithm 1 recalls how to classify an unknown example x using a dataset S with the KNN algorithm. This algorithm has also been seen during the lecture. You are expected to implement it from scratch yourself to understand the need for a good data representation a good distance and an efficient data structure.
Many difference distances can be used to compare two Freeman codes in the database. You may want to test the performance of your KNN algorithm using distances such as a L1, L2 or a Mahalanobis distance on other types of image representation (quad tree, 0-1 pixel matrix, interest point detectors,...). However, the most natural distance between strings is probably the edit distance.
Rather than counting the number of edit operations, one can assign an edit cost to each edit operation and search for the less costly transformation:The edit distance between two strings, given the cost for each edit operations, can be computed efficiently using the following dynamic programming algorithm. You can find examples of how this is computed here.
To be able to identify the digit online (with a real time constraint), you may have to optimize your original algorithm (e.g. by reducing the size of the database to reduce the computations). Two algorithms are proposed to speed up the classification process: one (see Algo. 2) which removes from the original dataset S the outliers and the examples located in the Bayesian error region, smoothing the decision boundaries, and one (see Algo. 3 and 3) which deletes the irrelevant examples (e.g. the centers of the homogeneous clusters). Another optimization can be made to speed up the original algorithm by speeding up the search for the nearest neighbor using the triangle inequality property of the distance function (see 2 for more details on the techniques).