Convolutional Neural Network Testing

Now, we have some trained models and need to know how well they perform on unseen data.

As, [1] Dr.Jason Brownee wrote :

“Why can’t you prepare your machine learning algorithm on your training dataset and use predictions from this same dataset to evaluate performance? The simple answer is overfi tting.”

Red sea, Yemen, 2000

Red sea near Al Hudaydah and Mocha, Yemen, 2000

Evaluate CNN models

We have trained two CNN models on a training dataset with different parameters: epochs, batch_size and number of samples.

Model Run Epochs Batch size Training set Validation set Testing set Accuracy (test)
model n°01 01 30 50 84,000 21,000 10,000 96.72%
model n°01 02 30 50 67,200 16,800 10,000 97.13%
model n°02 01 30 50 84,000 21,000 10,000 98.31%
model n°02 02 30 50 84,000 7,000 10,000 99.82%

We also prepared six testing datasets to evaluate our models by generating characters with different fonts.

evaluation dataset 01

Set n°1

evaluation dataset 02

Set n°2

evaluation dataset 03

Set n°3

Evaluation dataset 04

Set n°4

Evaluation dataset 05

Set n°5

Evaluation dataset 06

Set n°6

This is not really handwriting, but good enough to study the way deep learning algorithms behave on such datasets.

Currently, we do not have a set of handwritten characters sufficient to train our models, but enough for evaluation. Though we are working on creating a set of handwritten characters from different people.

The two models evaluated

The architecture of these two sequential models is close and can fit into our GPU memory.

Model n°1

Lines 5, 7, 9 – This first model has 3 convolutional layers with increasing number of filters. We use a kernel of 5 by 5 on first one and a kernel of 3 by 3 for the next ones.

Line 6 – After the first convolutional layer, we drop out 30 percent of the input to help prevent overfitting.

Line 8, 10 – After the second and third convolutional layers we use pooling.

As wrote by Dr. Jason Brwnee.

As such, pooling may be consider a technique to compress or generalize feature representations and generally reduce the overfi tting of the training data by the model.

Line 12, 14 – The next layers are two dense ones with 1024 neurons and dropout between them. The activation function used here, is the hyperbolic tangent function.

Line 13, 15 – We drop 30% and 20% of the input out to prevent overfitting.

Line 16 – The model ends with our classification layer with num_classes=62 with the softmax activation function.

Line 18 – We use the adam optimizer defined by [2] M. Diederik Kingma and M. Jimmy Ba in 2014.

Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.

Below the graph shown in Tensorboard.

CNN model n°01

Figure 7 – CNN model n°01

We got the following results with it on the first run, which lasted 5 hours and 45 minutes and we were at the limits of our laptop.

Epoch 24/30 - 586s - loss: 0.1424 - acc: 0.9616 - val_loss: 0.1351 - val_acc: 0.9672

Evaluating model... CNN Model 001 Error: 3.28% Accuracy: 96.72%

The second run gave us the below result:

Epoch 25/30 - 475s - loss: 0.1236 - acc: 0.9660 - val_loss: 0.1297 - val_acc: 0.9713

Evaluating model... CNN Model 001 Error: 2.875000% Accuracy: 97.225%

If, the accuracy is not very high, it is a fairly good baseline to study its variation with different parameters and test the model on several testing sets.

The difference between these two runs are the size of the training and validation datasets, but also the way we fed the model. They are expanded then shuffled, while the former was just expanded by shift and rotation. There is a good improvement in accuracy by mixing the sets up.

Model n°2

Lines 5, 7 – This second model has 2 convolutional layers with respectively 32 and 64 filters. We use a kernel of 5 by 5 on the first one and a kernel of 3 by 3 for the next one.

Lines 6, 8 – After both conv2D layers we use pooling.

Lines 10 – The next layer is a dense one with 1024 neurons. The activation function used here, is the relu function.

Lines 11 – We then drop out 50 percent of the input to prevent overfitting.

Lines 12 – The model ends with the same layer as before.

Lines 14 – The model is compiled with the same parameters.

Below the graph shown in Tensorboard.

CNN model

Figure 8 – CNN model n°02

We ran two trainings and got the following results:

Epoch 25/30 - 317s - loss: 0.0638 - acc: 0.9835 - val_loss: 0.0760 - val_acc: 0.9831

Evaluating model... CNN Model 002 Error: 1.69% Accuracy: 98.31%

The second run gave us the below result:

Epoch 27/30 - 312s - loss: 0.0605 - acc: 0.9844 - val_loss: 0.0047 - val_acc: 0.9983

Evaluating model... CNN Model 002 Error: 0.171429% Accuracy: 99.82%

We get better results, during the training step, with this model. The fact that we shuffle the training and validation sets after having expanded them improve the accuracy. Nevertheless, the question is how they will behave on unseen data.

Loading data

The next module called  contains the functions to load the images and labels to evaluate ours models.

The packages used in this module.

The function load_data_for_evaluation to build the images and labels lists.

This function build the arrays of gray images, the labels and filenames associated.

The main module

The main module is called .

The parameters are:

  • data_dir  : the folder where images are stored,
  • logs  : the folder where we will write the data produced by the script(sprite image, metadata, checkpoints, events log, …),
  • model_number  : the model to build as different types of models address the problem of classifying characters,
  • weights  : the best weights of a previous run to reload.

We start by loading the packages and function used. The lines before 34 are comments.

Line 39 – We set the environment variable to avoid unnecessary warning from Tensorflow.

Line 45-46 – We load our local functions/modules.

Then we set the image ordering for keras as Theano’s one, channel first.

The main procedure starts by setting the random seed for reproducibility.

And call the function seen above to load the images and labels.

Resizing arrays depends on the image_dim_ordering variable seen just before.

Line 73 – We normalize our data as we did for training.

Line 76 – We standardize the labels/classes.

Line 77 – Because the number of classes depends on the testing dataset. It must contain all categories, all types of characters [0-9a-zA-Z].

Lines 81, 83, 86 – We call the function to build and compile our model.

Lines between 86 and 99 have been removed for simplification, as we already have more models. We will use them to illustrate a future article about Convnet and Long Short Term Memory neural network.

Line 104 – We try to reload the best weights saved during the checkpoints training stage.

Line 110 – We compile the model.

Line 119-120 – We call the evaluate function of the model then display the scores.

The end of this script is just the usual __main__ and the construction of the arguments list.

The last paragraph but not least, the evaluations.

We ran this script on the testing datasets mentioned at the beginning of this article, and obtained the results shown below. The last testing dataset contains only handwritten characters.

Eval. Datasets name and size

Figure 9 – Evaluation datasets name and size

CNN Evaluation results

Figure 10 – CNN Evaluation results

What can we learn from this experiment?

We see that both models get quite close results on each test set, with a slight advantage in the second model. They perform well on three datasets and poorly on four.

An experiment, not shown here, leads us to think that some test datasets can be considered as a superset of the others. The quality and variety of training datasets are essential for improving learning. A bit like humans by the diversity of the readings that we can have, different vocabularies, different styles, different typographies.

The performance of the two models on the last test dataset does not plead for the approach we have. In other words, using characters generated from fonts as training basis of our models doesn’t seem to be relevant. But they are characters, are not they?

Is this type of neural network is able to learn the concept and the meaning of symbol? From what depth and extent can a neural network be estimated to integrate a concept?

Even if these models can be classified as poorly performing, weak models, they are good candidates for our next experiment. We will use them in a voting method that we will show in a future article.

Before, we will see what they can predict and in which cases they reveal ambiguities.


[1] – Dr. Jason Brownee, (2017) Machine learning mastery with python.

[2] – Diederik Kingma, Jimmy Ba. (2014) Adam: A Method for Stochastic Optimization, arXiv:1412.6980v8 [cs.LG].


[1] MNIST database – 


Social Media