AI with Convolutional Neural Network

When you enter a new area of knowledge, the journey can look like a fantastic story. We have with Artificial Intelligence the impression of entering one of the tents described by J.K. Rolling in Harry Potter and the Goblet of fire. From the outside, the domain does not give the full dimension of its extent. You are then caught by the subject, and it is a real pleasure to discover the rooms of the building one after the other.

In artificial intelligence, we talk about weak and strong artificial intelligence. The encyclopedia Wikipedia gives you the definition of each of these areas of artificial intelligence.

Weak artificial intelligence (weak AI), also known as narrow AI, is non-sentient artificial intelligence that is focused on one narrow task. Weak AI is defined in contrast to either strong AI (a machine with consciousness, sentience and mind) or artificial general intelligence (a machine with the ability to apply intelligence to any problem, rather than just one specific problem). All currently existing systems considered artificial intelligence of any sort are weak AI at most.

Our study falls into the field of weak artificial intelligence, and we will use different algorithms to build a solution with the good reliability and accuracy with few means.

AlHudaydah harbour

AlHudaydah harbour, Yemen, 2000

As Dr. Théodore Bluche tell us in his thesis – [1] – Théodore Bluche, 2015.

In handwriting recognition, however, deep neural networks are limited to convolutional architectures, such as convolutional neural networks or MDLSTM-RNNs, with only a few weights and extracted features in bottom layers. Densely connected neural networks with more than one or two hidden layers are only found in smaller applications, such as the recognition of isolated characters ([2] Ciresan et al., 2010; Ciresan et al., 2012) or keyword spotting ([3] Thomas et al., 2013).

Here our goal is, initially, to recognize isolated characters through a network of neurons and deepen our knowledge in the field.

Convolutional Neural Network

Dr. Jason Brownlee write in his article about deep learning “Crash Course in Convolutional Neural Networks for Machine Learning

Convolutional Neural Networks are a powerful artificial neural network technique. These networks preserve the spatial structure of the problem and were developed for object recognition tasks such as handwritten digit recognition. They are popular because people are achieving state-of-the-art results on difficult computer vision and natural language processing tasks.

The Convolutional neural networks, called also convnets or CNN have been introduce by [4] Yan Lecun et al. in 1989 on handwritten digits recognition which is a classification problem. They built the MNIST database (Modified National Institute of Standards and Technology database) that is a large database of handwritten digits that is commonly used for training various image processing systems.

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

The description of a convolutional neural network is given by [5] Matthew D. Zeiler and Rob Fergus as follow:

The model map a color 2D input image xi, via a series of layers, to a probability vector yi over the C different classes.

Each layer consists of

  • (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters;
  • (ii) passing the responses through a rectified linear function (relu(x) = max(x; 0));
  • (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps.

The Architecture of the CNN used

A single GeForce GT 750M GPU has only 2048 MBytes of memory, which limits the maximum size of the networks that can be trained on it. Moreover, this limits the size of the validation et test sets with Tensorflow. The size of the memory 8GB on our laptop limits also the size of the training set to around 150,000 samples.

Our first Convolutional Neural Network is a simple adaptation of the network used in studies done with the mnist database. The differences are the size of the input, 2304 cells instead of 784 cells and the number of classes, 62 classes instead of 10.

The images have a resolution of 48 pixels with a depth of 8, and we classify them to numbers [0-9], and english letters [a-zA-Z].

We built the model using Keras with Tensorflow as back-end.

The picture below shows the whole network. This graph has to be read bottom-up.

CNN model

Figure 1 – Convolutional Neural Network model

You will find the description of the different layers and compounds of a Convolutional Neural Network model in the Dr. Jason Brownlee’s article already mentioned.

To decrease the amount of memory needed, we implement a pooling layer that divide, here by 2, the number of the cells passed to the next layer.

After two groups of convolutional and pooling layers, which extract the features of the image, we flatten the matrix to go through one fully connected layers to compute the probabilities to classify the character embedded into the image.

We trained this model several times on a training set of 44,788 characters, a validation set of 700 characters for 30 epochs, then tested the best weights on testing set of 11,198 characters.

The next figures show the curves of the accuracy and loss for both training and validation steps.

Accuracy and Loss curves for training

Figure 2 – Accuracy and Loss curves for training.

Accuracy and Loss curves for validation

Figure 3 – Accuracy and Loss curves for validation

We can see on those curves that the learning process is fast,may be too fast. The validation score has a drop at the third epoch and the accuracy reaches 99.46 % at the tenth epoch. The validation set is, our opinion, not big enough. Unfortunately, we cannot increase it for lack of means. Nevertheless, the results give fairly interesting ideas to get around this.

We can use aggregation and voting mechanisms with so-called weak models.

Tensorboard provides the projector facility to display the distribution of our data. Their analysis is greatly eased.

Tensorboard data projection

Figure 6 – Tensorboard data projection with t-SNE algorithm.

With t-SNE algorithm after more than 5,500 iterations, you get this representation of the distribution of your data after a simulation of the learning process.

Tensorboard data projection 8000

Figure 7 – Tensorboard data projection with t-SNE algorithm after more than 5,000 iterations

Now we have the frame, now let’s go to its implementation in python 3.6 with keras 2.0.6 and tensorflow 1.4.

Model building

We are going to define a first module called . It contains the different models that we tested. Here only the second one is shown.

This defines a 7-layer CNN model shown in figure 1.

Line 11, we create the sequential model which is a linear stack of layers. The Keras documentation gives you all you need.

Lines 12 and 14,  we define two 2D convolution layers to extract the features of the images provided. The first one has 32 filters and a kernel of 5×5,the other one has 64 filters and a kernel of 3X3. Both of them have an “relu” activation.

  • filters: Integer, the dimensionality of the output space (i.e. the number output of filters in the convolution).
  • kernel_size: An integer or tuple/list of 2 integers, specifying the width and height of the 2D convolution window. Can be a single integer to specify the same value for all spatial dimensions.
  • activation=’relu’: The ReLU function is f(x)=max(0,x). One way ReLUs improve neural networks is by speeding up training. The gradient computation is very simple (either 0 or 1 depending on the sign of x). Also, the computational step of a ReLU is easy: any negative elements are set to 0.0 — no exponentials, no multiplication or division operations.

You may find an interesting article about convnets here [intuitive-explanation-convnets].

Lines 12 and 14, we define two pooling layers, each one reduces the dimensionality of the feature map. The first pooling layer allows to fit to the memory available on our GPU.

Line 16, we flatten the input to fill the next dense layer.

Line 17, we use a fully connected layer which is a traditional Multi Layer Perceptron that use a ‘relu’ activation fonction.

Line 18, we drop randomly half percent of the output of the previous dense layer, which helps to prevent overfitting.

Line 19, this is the last fully connected layer with a softmax activation function for ranking the result with some probability.

Loading data

The next module called contains the functions to prepare and load the images and labels to feed ours models.

The packages used in this module.

The function load_data to build the images and labels lists.

This function takes three parameters input_folder, expand and train_test_ratio.

  • input_folder: where to find images to load,
  • expand: if we want to augment our images by skewing and shifting them,
  • train_test_ratio: the distribution between learning and validation data.

Line 68, we shuffle the array of filenames to improve the learning process and the overall accuracy of the model.

Lines 69-74, we build the array of tags that are included in the file name. The way we generate those images can be found in the previous article “Handwriting recognition – Seven: Dataset generation”

Lines 76-77, we call the function to expand our train and validation sets. This makes it possible to enrich the character set and substantially improve the accuracy of the model by about 15 to 20 percent.

Lines 79-84, we compute the offset between training set and validation set, split the array of images and the array of labels then return them.

The function expand_training_data creates new images with skew and shift. This function comes from hwalsuklee’s study on github with some adaptations. This needs to be improve as it doesn’t shuffle the new data.

As we wrote this article our research led us to a small nugget: the image preprocessing module of keras. So in the future, we will see to use the functions of this module which proposes functions of this kind and more like zoom and shear.

Training the model

The third module called is the main one.

The parameters are:

  • data_dir : the folder where images are stored,
  • logs : the folder where we will write the data produced by the script(sprite image, metadata, checkpoints, events log, …),
  • models : the folder where we will save our model as a json file,
  • model_number : the model to build as different types of models address the problem of classifying characters,
  • epochs : the number of iterations,
  • batch_size : the size of the samples used to train the model,
  • expand : if we want to increase the number of samples,
  • reload : if we want to load the best weights of a previous run.

We start by loading the packages and function used.

Line 27, we set the variable TF_CPP_MIN_LOG_LEVEL before loading Keras packages to avoid unwanted logging from Tensorflow the back end.

set_image_dim_ordering  specifies which dimension ordering convention Keras will follow.

As you may found in Keras documentation:

  • For 2D data (e.g. image), "tf" assumes (rows, cols, channels) while "th" assumes (channels, rows, cols).

Below we start by the end of the script where we build the our list of argument and call the function main with it.

Line 180, we insert the model number within the logs folder name.

Now let’s dive into the main procedure.

Lines 47-48, we fix the random seed to be able to reproduce the result between different runs.

Lines 53-54, we call our function to load and build the training and validation sets.

Then we normalize the data and one hot encode the labels.

Although we have been using Python for just six months, we are amazed by its simplicity and power.

The next step is to build the metadata and the sprite image for Tensorboard and its projector.

Line 73, we limit the size of the sprite image to 10,000 thumbnails, that is, an image of 100 by 100.

Below we build our model in relation to the chosen one.

The argument of the function cnn_model_nn is the number of characters within our classification, that is 62 [0-9a-zA-Z].

The numbering of the lines presents a difference because in the current script we already have many more models.

Lines 111-114,  we load the weights and compile again the model.

This callback function is very classic. You may find more inquiries in Keras documentation.

The next one is more complex as when we wrote this script Keras did not provide all parameters needed by the Tensorboard projector.

The method TensorResponseBoard has been written by Mr Yu-Yang. You can find the code of this module at

Now, we have just to call the fit function to train our model, this takes sometime.

Lines 138-140, we call the fit function with all our parameters.

Lines 155-158, we save our model with json format.

Depending on the number of epochs, the batch size, the size of the training and validation set, you get an accuracy above 99 percent.

So far, we have some models that have been trained and evaluated on a dataset, next we will test these models against other test datasets.

We will show this in a future article.


[1] – Théodore Bluche. Deep Neural Networks for Large Vocabulary Handwritten Text Recognition. Computers and Society [cs.CY]. Université Paris Sud – Paris XI, 2015. English. <NNT : 2015PA112062>.

[2] – Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep big simple neural nets for handwritten digit recognition. Neural computation, 22 (12), 3207–3220.

[3] – Thomas, S., Chatelain, C., Paquet, T., & Heutte, L. (2013). Un modèle neuro markovien profond pour l’extraction de séquences dans des documents manuscrits. Document numérique, 16 (2), 49–68.

[4] – LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1 (4), 541–551.

[5] – Matthew D. Zeiler, Rob Fergus. Visualizing and Understanding Convolutional Networks, arXiv:1311.2901v3 [cs.CV] 28 Nov 2013.


[1] MNIST database – 


Social Media