Make predictions with a neural network.

After building and evaluating neural network models let’s see how they behave to make predictions. We will evaluate model-by-model predictions and then use several training results and models with the bagging and voting method.

Bagging and voting are popular methods for combining the predictions from different models:

  • When bagging, you build multiple models, typically of the same type, and train them on different subsamples of the training dataset.
  • When voting, you build multiple models, typically of the different type, and simple statistics are used to combine predictions. Here, we calculate the mean of predictions.


Keukenhof garden

Keukenhof garden, Netherlands, 1994

Combine models

By now we have built five models. There are two convolutional neural networks models and three convolutional neural networks and Long Short Term Memory models.

These are not broad and deep networks. However, they give correct results, a correct basis for our experimentation.

Below, the summary of two of them.

The number of parameters for this model is quite large. Is-it large enough?

Because the CNN + LSTM contains two stacked sequential models, the summary doesn’t gives the structure of the first one.

The models have been trained with the following nvidia gpu and driver.

Nvidia Gpu

figure 1 – Nvidia GPU

For this experiment we have used the models and runs shown below. The python scripts we wrote allows to make predictions with a single model or to choose between bagging, voting or both methods.

Models used

Figure 2 – Models used for the experiment.

The models have been trained on three datasets shown below.

Training dataset img_chars

Set n°1

Dataset img_chars_train_special

Set n°2


Set n°3

The shape of these characters are quite different, a good basis to study the way ensemble methods could be used with weak models.

We also prepared six testing datasets to evaluate our models, five by generating characters with different fonts and a last one created from handwriting (the same used for training two models).

Set n°1

Set n°2

Set n°3

Set n°4

Set n°5


Set n°6

The basis of the experimentation being established, let’s go to make predictions.

Making predictions

This short video shows the execution of the python script that implements the voting method. It is a series of loops for each model, for each image, where we retain the five highest probabilities that are stored in a dictionary. At the end we calculate the average of the probabilities and we keep the five largest for each image, then we compute the accuracy for the whole sample.

With a single model

We could choose any model to compare its performance across multiple sets of evaluation data. We select the third model in the list [model02 – CNN – run_02_99.83] and watch its fulfillment on the training dataset and three other datasets (two generated from fonts, one from handwriting).

Here are the results for 150 samples.

accuracy dataset img_chars

Figure 3 – Predictions accuracy dataset img_chars

accuracy dataset eval_01

Figure 4 – Predictions accuracy dataset eval_01

accuracy dataset eval_05

Figure 5 – Predictions accuracy dataset eval_05

accuracy dataset img_hw_collection

Figure 6 – Predictions accuracy dataset img_hw_coll.

The results are in line with our expectations. The model performs well on the training dataset, obviously, and is much less accurate for other unseen datasets.

This gives us a baseline, or comparison, for the following methods. Let’s see now using an ensemble method.

With the Bagging method

Here we use a different python script that allows us to set the models and runs used to make predictions.

We choose the same model and its three runs [model02 – CNN – run_01_98.31 / run_02_99.83 / run_03_100.00]

Bagging accuracy dataset eval_01

Figure 7 – Bagging accuracy dataset eval_01

Bagging accuracy dataset eval_05

Figure 8 – Bagging accuracy dataset eval_05

Bagging accuracy dataset hw_coll

Figure 9 – Bagging accuracy dataset hw_coll

There are improvements, yet not for all testing datasets, not for handwriting. We get better results with an increase in precision, respectively 21, 19 and 0 points.

We had even worse overall score on the hw_collection dataset. The models have very variable precisions on this dataset, therefore the choice of runs should be done carefully, in example by excluding those with a precision less than 99.5. The training could also be carried out on a larger data set 55,000 instead of the current 35,000 with the enrichment of the samples to arrive at a set of more than 130,000 specimens.

Now, have a look to voting method.

With the Voting method

We use the same script with different models and runs. We choose the following ones:

  • model01 – CNN – run_02_97.23
  • model02 – CNN – run_03_100.00
  • model05 – CNN – run_01_99.93

Voting accuracy dataset eval_01

Figure 10 – Voting accuracy dataset eval_01

Voting accuracy dataset eval_05

Figure 11 – Voting accuracy dataset eval_05

 Voting accuracy dataset hw_coll

Figure 12 – Voting accuracy dataset hw_coll

We get the same results on eval_01 ans eval_02 with an identical distribution. These results show that the choice of the models and runs were not really judicious. One can rightly think that model01 penalizes the whole.

As for the last data set, the precision returns to that obtained during the first test with the single model model02-CNN-run_02_99.83, though the dispersal is different.

With a combine method

For this experimentation we used those models and runs:

  • model01 – CNN – run_02_97.23
  • model02 – CNN – run_02_99.83 / run_03_100.00
  • model03 – CNN+LSTM – run_01_94.60
  • model04 – CNN+LSTM – run_01_96.65
  • model05 – CNN+LSTM – run_02_99.92 / run_01_99.93

dataset eval_01

Figure 13 – Bagging & Voting dataset eval_01

dataset eval_05

Figure 14 – Bagging & Voting dataset eval_05

dataset hw_coll

Figure 15 – Bagging & Voting dataset hw_coll

The overall accuracy is now 100 percent on all datasets. Nevertheless, these results are reached on 150 samples only. It is interesting to take a closer look at the distribution of results. For the first two data sets we have a compression of the first order precision of 6 and 17 points, whereas for the last one it is an increase of the order of 32 points. Models three and four are for many. It must be said that the experiment is somewhat biased.

But, this gives us that at the same time a good idea of improving performance with overall methods.

So let’s go for another try, this time with 1,500 samples.

dataset eval_01

Figure 16 – Bagging & Voting dataset eval_01

dataset eval_05

Figure 17 – Bagging & Voting dataset eval_05

dataset hw_coll

Figure 18 – Bagging & Voting dataset hw_coll

The distribution is wider for most of those predictions, yet we still get an overall accuracy of 100 percent. This is great.

Applying the method on six testing datasets we come out with the following result.

graph six datasets

Figure 19 – Bagging & Voting on six datasets

The subject we are working on lends itself well to this kind of result. Indeed, by relying on lexicons of words, we could certainly transcribe some writings. However, we must be careful because the calligraphy of the people asked to build our handwriting collection are legible and regular. This is not really the case for all the events recorded in the civil registers, especially after the French Revolution, between 1800 and 1812.


[1] – Dr. Jason Brownee, (2017) Machine learning mastery with python.


[1] MNIST database – 


Social Media