Article

Connectionist Temporal Classification

After reading the paper from [1] Farès Menasri et al. “The A2iA French handwriting recognition system at the Rimes-ICDAR2011 competition”, we looked for inquiries about Connectionist Temporal Classification (CTC) as it is a really interesting method to label unsegmented sequences directly.

In their article [2] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber present this method as follows.

Many real-world sequence learning tasks require the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their outputs into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems. 

Keras is a model-level library, providing high-level building blocks for developing deep learning models.

Keras implements Connectionist Temporal Classification algorithm through ctc_batch_cost, ctc_decode and ctc_label_dense_to_sparse for TensorFlow, Theano and CNTK backends.

Moreover, they provide the python program called image_ocr.py, a really good example which trains a RNN+GRU model on  generated images of different sizes with english words.

 

Taizz Yemen 2000

Stained-glass window craftman, Taizz, Yemen, 2000

The training dataset

We have adapted the program to train a larger model to deal with French alphabet and script fonts. 

There are 101 characters taken into account:

  • the lowercase,
  • the uppercase,
  • the c cedilla (ç),
  • the apostrophe,
  • accented vowels (àâæéèëêîïôöœùûüÿ),
  • accented capital letters (ÀÂÆÇÉÈÊËÎÏÔŒÙÛÜŸ),
  • some consonants accented for some first names (i.e ñ),
  • some punctuation and special characters,
  • and the indispensable character ‘space’.

The word lists have been replaced by French words, surnames and first names. They contain 123,000 items for the first learning phase and 105,000 pairs for the second phase.

The first list is composed of the 1750 most used words of the French language, the lexicon of the [ref. 1] CNRTL, 29999 surnames and 4320 first names.

The second training dataset is made up of surnames and first names, by the random generation of pairs consisting of a surname and a given name, then two female first names and finally two male first names.

With these two word lists, the program generates images, 64X128 and 64X512 pixels, through the Keras function fit_generator and the generator TextImageGenerator that it provides at each stage of learning.

The model

As the example provided use only the 26 lowercases of the English alphabet, the model is deep and large enough to get fair results.

In our case, three criteria lead us to deepen and expand the model: the greater number of symbols, the length of the words supported and the script fonts used.

The summary and the graph of the model are shown below.

RNN_GRU model summary

Figure 1 – RNN_GRU model summary

 

CTC RNN GRU

Figure 2 – CTC Recurrent Neural Network with GRU

 

We have added one conv2d layer and change the number of filters for each conv2d layer. We also have widen the first gru layer.

The training

We trained the model with the two phases already present in the example provided, then we increased the number of epochs for the second phase. Finally, we created a third phase with a larger number of words with an increased maximum length.

Our experiment shows a model behavior with breaks in learning. The loss steadily decreases to a breaking point, not always the same value, where it can rise between 10 or 40 before decreasing quickly again.

The graph below show the learning curve for the first stage [epoch 1 to 19].

CTC Learning curve

Figure 3 – CTC Learning curve stage one

The peaks observed around the epoch 2 and 5 can be explained by the implementation of the learning functions. The function on_epoch_begin of TextImageGenerator class change the binding of the paint function several times.

CTC Learning curve stage two

Figure 4 – CTC Learning curve stage two

During the second learning phase [epochs 20-40], the peak at the epoch twenty-one is due to the generation of a new list of words. On the other hand, the second peak is caused by a sudden stall of the neural network. Why? We haven’t any answer to this behavior.

If we carry on the learning process, several stalls will be observed. The peaks around 20 and 40 are caused by a change in the training dataset. The peak around 55 remains unexplained. However, the model stabilizes and the loss diminishes quickly.

CTC Learning curve all epochs

Figure 5 – CTC Learning curve all epochs

When the training ends we get loss: 1.1731 and val_loss: 2.4340.

The program provides a visualization callback which generates the following samples.

CTC Samples epoch 19

Figure 6 – CTC Samples epoch 19

Above, nearly all samples are correctly predicted, but for the last one an extra t is given. Here, the context must be taken into account to try to remove ambiguities using dictionaries and other qualifiers.

CTC Samples epoch 28

Figure 7 – CTC Samples epoch 28

Above, 50 percent of the samples are correctly predicted. Yet, given the complexity of the fonts, the model performs quite well. Errors could be removed by relying on a lexicon and ensemble methods.

One of the great difficulties of register page transcription concerns the distinction between the exact spelling of given names or surnames and the spelling error that is found quite frequently at certain periods.

Overall, the results obtained are convincing, even impressive.

Now, we have to study the decoding function and the performance of the model with images coming from registry extracts like this one.

Surname from register

Figure 8 – Surname from register.

Making predictions

With Keras we have several functions to make predictions with the model that we trained. Since the image_ocr.py program uses an image generator with fit_generator, it makes sense to continue with the same image generator to make predictions with the predict_generator function. At first, to better understand the decoding of the probabilities achieved.

So on the basis of the training program we realized the program to make predictions with generated images. Many small changes have been made to our purposes.

We get results similar to those of learning, like the ones below.

CTC Samples predictions

Figure 9 – CTC Samples prediction with keras.predict_generator

Then, we made predictions for images prepared from words extracted from registers. The results were quite disturbing. At the beginning, they did not make sense. After some corrections, the results proved to be disappointing.

If the model performs correctly with images generated, it can not decode handwritten words from images with less contrast or larger characters, like the image shown figure 8.

It seemed to us that the background of the image strongly disrupted the model. We conducted an experiment with a series of images produced from the IAM database. Even if the predictions are not correct, we can note a clear improvement in the decoded sequences.

CTC Samples predictions

Figure 10 – CTC predictions with Samples from IAM Database (keras.predict)

These various experiments show the potential of the method and its weaknesses: the importance of datasets for training models. It seems that abstraction is still out of the range of these neural networks.

We will continue our study using other databases such as IAM or Rimes, while developing tools for segmentation of lines and words.

References

[1] – Farès Menasri, Jérôme Louradour, Anne-Laure Bianne-Bernard and Christopher Kermorvant, (2012) The A2iA French handwriting recognition system at the Rimes-ICDAR2011 competition.

[2] – Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, (2006) Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks .

Resources

[1] Centre National de Ressources Textuelles et Lexicales (CNRS – UMR ATILF) – France – http://www.cnrtl.fr

[2] IAM handwriting database – http://www.fki.inf.unibe.ch/databases/iam-handwriting-database

[3] Rimes handwritten database – http://www.a2ialab.com/doku.php?id=rimes_database:start

Navigation

Social Media