IT Engeneering
Now, we have some trained models and need to know how well they perform on unseen data.
As, [1] Dr.Jason Brownee wrote :
“Why can’t you prepare your machine learning algorithm on your training dataset and use predictions from this same dataset to evaluate performance? The simple answer is overfitting.”
We have trained two CNN models on a training dataset with different parameters: epochs, batch_size and number of samples.
Model | Run | Epochs | Batch size | Training set | Validation set | Testing set | Accuracy (test) |
model n°01 | 01 | 30 | 50 | 84,000 | 21,000 | 10,000 | 96.72% |
model n°01 | 02 | 30 | 50 | 67,200 | 16,800 | 10,000 | 97.13% |
model n°02 | 01 | 30 | 50 | 84,000 | 21,000 | 10,000 | 98.31% |
model n°02 | 02 | 30 | 50 | 84,000 | 7,000 | 10,000 | 99.82% |
We also prepared six testing datasets to evaluate our models by generating characters with different fonts.
This is not really handwriting, but good enough to study the way deep learning algorithms behave on such datasets.
Currently, we do not have a set of handwritten characters sufficient to train our models, but enough for evaluation. Though we are working on creating a set of handwritten characters from different people.
The architecture of these two sequential models is close and can fit into our GPU memory.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# define the first model def cnn_model_01(num_classes): # create model 01 model = Sequential() model.add(Conv2D(3, (5, 5), input_shape=(1, 48, 48), activation='relu')) model.add(Dropout(0.3)) model.add(Conv2D(96, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(128, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(1024, activation='tanh')) model.add(Dropout(0.3)) model.add(Dense(1024, activation='tanh')) model.add(Dropout(0.2)) model.add(Dense(num_classes, activation='softmax', name='classes')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model |
Lines 5, 7, 9 – This first model has 3 convolutional layers with increasing number of filters. We use a kernel of 5 by 5 on first one and a kernel of 3 by 3 for the next ones.
Line 6 – After the first convolutional layer, we drop out 30 percent of the input to help prevent overfitting.
Line 8, 10 – After the second and third convolutional layers we use pooling.
As wrote by Dr. Jason Brwnee.
As such, pooling may be consider a technique to compress or generalize feature representations and generally reduce the overfitting of the training data by the model.
Line 12, 14 – The next layers are two dense ones with 1024 neurons and dropout between them. The activation function used here, is the hyperbolic tangent function.
Line 13, 15 – We drop 30% and 20% of the input out to prevent overfitting.
Line 16 – The model ends with our classification layer with num_classes=62 with the softmax activation function.
Line 18 – We use the adam optimizer defined by [2] M. Diederik Kingma and M. Jimmy Ba in 2014.
Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments.
Below the graph shown in Tensorboard.
We got the following results with it on the first run, which lasted 5 hours and 45 minutes and we were at the limits of our laptop.
Epoch 24/30 - 586s - loss: 0.1424 - acc: 0.9616 - val_loss: 0.1351 - val_acc: 0.9672
Evaluating model... CNN Model 001 Error: 3.28% Accuracy: 96.72%
The second run gave us the below result:
Epoch 25/30 - 475s - loss: 0.1236 - acc: 0.9660 - val_loss: 0.1297 - val_acc: 0.9713
Evaluating model... CNN Model 001 Error: 2.875000% Accuracy: 97.225%
If, the accuracy is not very high, it is a fairly good baseline to study its variation with different parameters and test the model on several testing sets.
The difference between these two runs are the size of the training and validation datasets, but also the way we fed the model. They are expanded then shuffled, while the former was just expanded by shift and rotation. There is a good improvement in accuracy by mixing the sets up.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# define the model def cnn_model_02(num_classes): # create model 02 model = Sequential() model.add(Conv2D(32, (5, 5), input_shape=(1, 48, 48), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(1024, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(num_classes, activation='softmax', name='classes')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model |
Lines 5, 7 – This second model has 2 convolutional layers with respectively 32 and 64 filters. We use a kernel of 5 by 5 on the first one and a kernel of 3 by 3 for the next one.
Lines 6, 8 – After both conv2D layers we use pooling.
Lines 10 – The next layer is a dense one with 1024 neurons. The activation function used here, is the relu function.
Lines 11 – We then drop out 50 percent of the input to prevent overfitting.
Lines 12 – The model ends with the same layer as before.
Lines 14 – The model is compiled with the same parameters.
Below the graph shown in Tensorboard.
We ran two trainings and got the following results:
Epoch 25/30 - 317s - loss: 0.0638 - acc: 0.9835 - val_loss: 0.0760 - val_acc: 0.9831
Evaluating model... CNN Model 002 Error: 1.69% Accuracy: 98.31%
The second run gave us the below result:
Epoch 27/30 - 312s - loss: 0.0605 - acc: 0.9844 - val_loss: 0.0047 - val_acc: 0.9983
Evaluating model... CNN Model 002 Error: 0.171429% Accuracy: 99.82%
We get better results, during the training step, with this model. The fact that we shuffle the training and validation sets after having expanded them improve the accuracy. Nevertheless, the question is how they will behave on unseen data.
The next module called data_mgmt.py contains the functions to load the images and labels to evaluate ours models.
The packages used in this module.
19 20 21 22 23 24 |
import numpy as np # fundamental package for scientific computing import os # operating system dependent functionality from scipy import ndimage # various functions for multi-dimensional image processing from scipy import misc # scipy miscellaneous routines - various utilities from h7as.utils.utils import * # local functions |
The function load_data_for_evaluation to build the images and labels lists.
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
# define load data function def load_data_for_evaluation(input_folder="../../../h7as_data/img_chars_eval_01"): X_img = [] y_label = [] X_img_fns = np.array([x for x in os.listdir(input_folder) if x.endswith(".png")]) # shuffle the images np.random.shuffle(X_img_fns) for img_fn in X_img_fns: img = misc.imread(input_folder + '/' + img_fn,'L') X_img.append(img) # convert char to int label = get_alphabet_item(img_fn.split('.')[-2], 'i') y_label.append(label) X_eval = np.asarray(X_img) y_eval = np.asarray(y_label) return (X_eval, y_eval), X_img_fns |
This function build the arrays of gray images, the labels and filenames associated.
The main module is called h7as_char_cnn_evaluate.py .
1 2 |
usage: h7as_char_cnn_evaluate.py [-h] -d DATA_DIR -l LOGS -n MODEL_NUMBER -w WEIGHTS |
The parameters are:
We start by loading the packages and function used. The lines before 34 are comments.
34 35 36 37 38 39 40 41 42 43 44 45 46 |
# import the necessary packages import argparse # makes command line interfaces import numpy as np # fundamental package for scientific computing import time # import os # operating system dependent functionality os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' from keras.utils import np_utils from keras import backend as K from scipy import misc # scipy miscellaneous routines - various utilities # import modules of the project from h7as.models.keras_cnn_models import cnn_model_01, cnn_model_02 from h7as.data import data_mgmt |
Line 39 – We set the environment variable to avoid unnecessary warning from Tensorflow.
Line 45-46 – We load our local functions/modules.
Then we set the image ordering for keras as Theano’s one, channel first.
49 50 51 52 53 |
# tell keras to use theano ordering K.set_image_dim_ordering('th') # initialize FLAGS (arguments) FLAGS = None |
The main procedure starts by setting the random seed for reproducibility.
55 56 57 58 59 |
# main def main(FLAGS): # fix random seed for reproducibility seed = 7 np.random.seed(seed) |
And call the function seen above to load the images and labels.
61 62 63 64 65 66 67 68 69 70 |
print("Loading data..") loading_start_time = time.time() # load data (X_eval, y_eval), _ = data_mgmt.load_data_for_evaluation(FLAGS.data_dir) print('Loading duration (s) : %.4f' % (time.time() - loading_start_time)) # reshape to be [samples][channels][width][height] X_eval = X_eval.reshape(X_eval.shape[0], 1, data_mgmt.IMAGE_SIZE, data_mgmt.IMAGE_SIZE).astype('float32') |
Resizing arrays depends on the image_dim_ordering variable seen just before.
72 73 74 75 76 77 |
# normalize inputs from 0-255 to 0-1 X_eval = X_eval / 255 # one hot encode outputs y_eval = np_utils.to_categorical(y_eval) num_classes = y_eval.shape[1] |
Line 73 – We normalize our data as we did for training.
Line 76 – We standardize the labels/classes.
Line 77 – Because the number of classes depends on the testing dataset. It must contain all categories, all types of characters [0-9a-zA-Z].
79 80 81 82 83 84 85 86 |
# build the model if FLAGS.model_number == 1: model = cnn_model_01(num_classes) elif FLAGS.model_number == 2: model = cnn_model_02(num_classes) else: print("Model number not found, building the first one...") model = cnn_model_01(num_classes) |
Lines 81, 83, 86 – We call the function to build and compile our model.
Lines between 86 and 99 have been removed for simplification, as we already have more models. We will use them to illustrate a future article about Convnet and Long Short Term Memory neural network.
100 101 102 103 104 105 106 107 108 109 110 111 112 |
# reload the weigths print("Loading best weigths...") filepath = os.path.join(FLAGS.logs,FLAGS.weights) try: model.load_weights(filepath) except: print("Error loading weights {}.".format(filepath)) exit() # Compile model (required to make predictions) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print("Model compiled with loaded weights from previous run") |
Line 104 – We try to reload the best weights saved during the checkpoints training stage.
Line 110 – We compile the model.
114 115 116 117 118 119 120 121 122 123 |
# summarize performance of the model print("Evaluating model...") print("Number of samples: {:d}".format(X_eval.shape[0])) Evaluation_start_time = time.time() # Final evaluation of the model scores = model.evaluate(X_eval, y_eval, verbose=0, batch_size=100) print("CNN model accuracy: %.2f%% - error: %.2f%%" % (scores[1]*100, 100-scores[1]*100)) print('Evaluation duration (s) : %.4f' % (time.time() - Evaluation_start_time)) |
Line 119-120 – We call the evaluate function of the model then display the scores.
125 126 127 128 129 130 131 132 133 134 135 136 137 |
if __name__ == '__main__': # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-d", "--data_dir", required=True, help="path to the folder where data are stored") ap.add_argument("-l", "--logs", required=True, help="path to the folder to save logs and checkpoints") ap.add_argument("-n", "--model_number", required=True, type=int, help="model number") ap.add_argument("-w", "--weights", required=True, help="Saved weights to load") FLAGS, unparsed = ap.parse_known_args() main(FLAGS) |
The end of this script is just the usual __main__ and the construction of the arguments list.
We ran this script on the testing datasets mentioned at the beginning of this article, and obtained the results shown below. The last testing dataset contains only handwritten characters.
What can we learn from this experiment?
We see that both models get quite close results on each test set, with a slight advantage in the second model. They perform well on three datasets and poorly on four.
An experiment, not shown here, leads us to think that some test datasets can be considered as a superset of the others. The quality and variety of training datasets are essential for improving learning. A bit like humans by the diversity of the readings that we can have, different vocabularies, different styles, different typographies.
The performance of the two models on the last test dataset does not plead for the approach we have. In other words, using characters generated from fonts as training basis of our models doesn’t seem to be relevant. But they are characters, are not they?
Is this type of neural network is able to learn the concept and the meaning of symbol? From what depth and extent can a neural network be estimated to integrate a concept?
Even if these models can be classified as poorly performing, weak models, they are good candidates for our next experiment. We will use them in a voting method that we will show in a future article.
Before, we will see what they can predict and in which cases they reveal ambiguities.
[1] – Dr. Jason Brownee, (2017) Machine learning mastery with python.
[2] – Diederik Kingma, Jimmy Ba. (2014) Adam: A Method for Stochastic Optimization, arXiv:1412.6980v8 [cs.LG].
[1] MNIST database – http://yann.lecun.com/exdb/mnist/