Word Segmentation in Handwritten Documents Using Genetic Programming
Word segmentation in handwritten document is a difficult task because inter-word-spacing (i.e. the space between parts of the same word) is sometimes wider than the intra-word-spacing (i.e. the space between two consecutive words). Many different approaches to segmenting words have been proposed so far. However these segmentation approaches usually use some parameters that are manually tuned; meaning that they do not take into account the properties of the document in order to automatically calibrate the parameters.
In this project, we wish to explore the use of genetic programming in order to find relations between the characteristics of the text and the parameters of the word segmentation algorithm. A good starting point is the algorithm proposed by Manmatha and Rothfeder  which is a state-of-the-art word segmentation algorithm. This algorithm is based on the scale-space theory, which is a framework for representing image structures at different scales. The scale-space is obtained by Gaussian filtering. Roughly speaking, if we convolve the image by Gaussian kernels with different sizes (i.e. standard deviations), we will obtain the image structures at different scales. For a text line, by using Gaussian kernels of a certain size we can obtain the blobs that correspond to words. In the original paper, Manmatha and Rothfeder use an experimental formula to tune the size of the Gaussian kernels. However, their proposed formula is independent of the characteristics of the text line, such as how densely or how sparsely the characters are written. Therefore, the performance of the algorithm is sometimes effected by under-segmented and over-segmented errors. In order to mitigate this problem, we wish to use genetic programming to estimate the optimal size for the Gaussian kernels based on the properties of the text.
The student will be provided with the C/C++ source codes to work with a benchmark database in order to train the algorithms and evaluate the performance.