IT Engeneering
When you enter a new area of knowledge, the journey can look like a fantastic story. We have with Artificial Intelligence the impression of entering one of the tents described by J.K. Rolling in Harry Potter and the Goblet of fire. From the outside, the domain does not give the full dimension of its extent. You are then caught by the subject, and it is a real pleasure to discover the rooms of the building one after the other.
In artificial intelligence, we talk about weak and strong artificial intelligence. The encyclopedia Wikipedia gives you the definition of each of these areas of artificial intelligence.
Weak artificial intelligence (weak AI), also known as narrow AI, is non-sentient artificial intelligence that is focused on one narrow task. Weak AI is defined in contrast to either strong AI (a machine with consciousness, sentience and mind) or artificial general intelligence (a machine with the ability to apply intelligence to any problem, rather than just one specific problem). All currently existing systems considered artificial intelligence of any sort are weak AI at most.
Our study falls into the field of weak artificial intelligence, and we will use different algorithms to build a solution with the good reliability and accuracy with few means.
As Dr. Théodore Bluche tell us in his thesis – [1] – Théodore Bluche, 2015.
In handwriting recognition, however, deep neural networks are limited to convolutional architectures, such as convolutional neural networks or MDLSTM-RNNs, with only a few weights and extracted features in bottom layers. Densely connected neural networks with more than one or two hidden layers are only found in smaller applications, such as the recognition of isolated characters ([2] Ciresan et al., 2010; Ciresan et al., 2012) or keyword spotting ([3] Thomas et al., 2013).
Here our goal is, initially, to recognize isolated characters through a network of neurons and deepen our knowledge in the field.
Dr. Jason Brownlee write in his article about deep learning “Crash Course in Convolutional Neural Networks for Machine Learning”
Convolutional Neural Networks are a powerful artificial neural network technique. These networks preserve the spatial structure of the problem and were developed for object recognition tasks such as handwritten digit recognition. They are popular because people are achieving state-of-the-art results on difficult computer vision and natural language processing tasks.
The Convolutional neural networks, called also convnets or CNN have been introduce by [4] Yan Lecun et al. in 1989 on handwritten digits recognition which is a classification problem. They built the MNIST database (Modified National Institute of Standards and Technology database) that is a large database of handwritten digits that is commonly used for training various image processing systems.
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
The description of a convolutional neural network is given by [5] Matthew D. Zeiler and Rob Fergus as follow:
The model map a color 2D input image xi, via a series of layers, to a probability vector yi over the C different classes.
Each layer consists of
- (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters;
- (ii) passing the responses through a rectified linear function (relu(x) = max(x; 0));
- (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps.
A single GeForce GT 750M GPU has only 2048 MBytes of memory, which limits the maximum size of the networks that can be trained on it. Moreover, this limits the size of the validation et test sets with Tensorflow. The size of the memory 8GB on our laptop limits also the size of the training set to around 150,000 samples.
Our first Convolutional Neural Network is a simple adaptation of the network used in studies done with the mnist database. The differences are the size of the input, 2304 cells instead of 784 cells and the number of classes, 62 classes instead of 10.
The images have a resolution of 48 pixels with a depth of 8, and we classify them to numbers [0-9], and english letters [a-zA-Z].
We built the model using Keras with Tensorflow as back-end.
The picture below shows the whole network. This graph has to be read bottom-up.
You will find the description of the different layers and compounds of a Convolutional Neural Network model in the Dr. Jason Brownlee’s article already mentioned.
To decrease the amount of memory needed, we implement a pooling layer that divide, here by 2, the number of the cells passed to the next layer.
After two groups of convolutional and pooling layers, which extract the features of the image, we flatten the matrix to go through one fully connected layers to compute the probabilities to classify the character embedded into the image.
We trained this model several times on a training set of 44,788 characters, a validation set of 700 characters for 30 epochs, then tested the best weights on testing set of 11,198 characters.
The next figures show the curves of the accuracy and loss for both training and validation steps.
We can see on those curves that the learning process is fast,may be too fast. The validation score has a drop at the third epoch and the accuracy reaches 99.46 % at the tenth epoch. The validation set is, our opinion, not big enough. Unfortunately, we cannot increase it for lack of means. Nevertheless, the results give fairly interesting ideas to get around this.
We can use aggregation and voting mechanisms with so-called weak models.
Tensorboard provides the projector facility to display the distribution of our data. Their analysis is greatly eased.
With t-SNE algorithm after more than 5,500 iterations, you get this representation of the distribution of your data after a simulation of the learning process.
Now we have the frame, now let’s go to its implementation in python 3.6 with keras 2.0.6 and tensorflow 1.4.
We are going to define a first module called keras_cnn_models.py . It contains the different models that we tested. Here only the second one is shown.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.layers import Flatten from keras.layers.convolutional import Conv2D from keras.layers.convolutional import MaxPooling2D # define the model def cnn_model_02(num_classes): # create model 02 model = Sequential() model.add(Conv2D(32, (5, 5), input_shape=(1, 48, 48), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(1024, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(num_classes, activation='softmax', name='classes')) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) return model |
This defines a 7-layer CNN model shown in figure 1.
Line 11, we create the sequential model which is a linear stack of layers. The Keras documentation gives you all you need.
Lines 12 and 14, we define two 2D convolution layers to extract the features of the images provided. The first one has 32 filters and a kernel of 5×5,the other one has 64 filters and a kernel of 3X3. Both of them have an “relu” activation.
You may find an interesting article about convnets here [intuitive-explanation-convnets].
Lines 12 and 14, we define two pooling layers, each one reduces the dimensionality of the feature map. The first pooling layer allows to fit to the memory available on our GPU.
Line 16, we flatten the input to fill the next dense layer.
Line 17, we use a fully connected layer which is a traditional Multi Layer Perceptron that use a ‘relu’ activation fonction.
Line 18, we drop randomly half percent of the output of the previous dense layer, which helps to prevent overfitting.
Line 19, this is the last fully connected layer with a softmax activation function for ranking the result with some probability.
The next module called data_mgmt.py contains the functions to prepare and load the images and labels to feed ours models.
The packages used in this module.
19 20 21 22 23 24 |
import numpy as np # fundamental package for scientific computing import os # operating system dependent functionality from scipy import ndimage # various functions for multi-dimensional image processing from scipy import misc # scipy miscellaneous routines - various utilities from h7as.utils.utils import * # |
The function load_data to build the images and labels lists.
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# define load data function def load_data(input_folder="./img_chars", expand=True, train_test_ratio=.8): X_img = [] y_label = [] X_img_fns = np.array([x for x in os.listdir(input_folder) if x.endswith(".png")]) # shuffle the images np.random.shuffle(X_img_fns) for img_fn in X_img_fns: img = misc.imread(input_folder + '/' + img_fn) X_img.append(img) # convert char to int label = get_alphabet_item(img_fn.split('.')[-2], 'i') y_label.append(label) if expand: X_img,y_label = expand_training_data(X_img, y_label) x_offset = int((len(X_img) * train_test_ratio)) X_train = np.asarray(X_img[:x_offset]) y_train = np.asarray(y_label[:x_offset]) X_test = np.asarray(X_img[x_offset:]) y_test = np.asarray(y_label[x_offset:]) return (X_train, y_train), (X_test, y_test) |
This function takes three parameters input_folder, expand and train_test_ratio.
Line 68, we shuffle the array of filenames to improve the learning process and the overall accuracy of the model.
Lines 69-74, we build the array of tags that are included in the file name. The way we generate those images can be found in the previous article “Handwriting recognition – Seven: Dataset generation”
Lines 76-77, we call the function to expand our train and validation sets. This makes it possible to enrich the character set and substantially improve the accuracy of the model by about 15 to 20 percent.
Lines 79-84, we compute the offset between training set and validation set, split the array of images and the array of labels then return them.
The function expand_training_data creates new images with skew and shift. This function comes from hwalsuklee’s study on github with some adaptations. This needs to be improve as it doesn’t shuffle the new data.
As we wrote this article our research led us to a small nugget: the image preprocessing module of keras. So in the future, we will see to use the functions of this module which proposes functions of this kind and more like zoom and shear.
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# Augment training data def expand_training_data(images, labels): expanded_images = [] expanded_labels = [] j = 0 # counter for x, y in zip(images, labels): j = j+1 if j%1000==0: print ('expanding data : %03d / %03d' % (j,np.size(images,0))) # register original data expanded_images.append(x) expanded_labels.append(y) # get a value for the background # zero is the expected value, but median() is used to estimate background's value bg_value = np.median(x) # this is regarded as background's value image = np.reshape(x, (-1, 48)) for i in range(2): # rotate the image with random degree angle = np.random.randint(-15,15,1) new_img = ndimage.rotate(image,angle,reshape=False, cval=bg_value) # shift the image with random distance shift = np.random.randint(-2, 2, 2) new_img_ = ndimage.shift(new_img,shift, cval=bg_value) # register new training data expanded_images.append(new_img_) expanded_labels.append(y) return expanded_images, expanded_labels |
The third module called h7as_char_cnn_train.py is the main one.
1 2 3 |
usage: h7as_char_cnn_train.py [-h] -d DATA_DIR -l LOGS -m MODELS -n MODEL_NUMBER [-ep EPOCHS] [-bs BATCH_SIZE] [-e EXPAND] [-r RELOAD] |
The parameters are:
We start by loading the packages and function used.
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# import the necessary packages from argparse import ArgumentParser # argument parser import numpy as np # fundamental <em>package</em> for scientific computing import os # operating system dependent functionality os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # seet the Tensorflow log level from keras.utils import np_utils from keras.callbacks import ModelCheckpoint from keras import backend as K import time from datetime import datetime import shutil from h7as.contrib.tensorresponseboard import TensorResponseBoard from h7as.models.keras_cnn_models import cnn_model_01, cnn_model_02 from h7as.utils.utils import * from h7as.data.data_mgmt import load_data |
Line 27, we set the variable TF_CPP_MIN_LOG_LEVEL before loading Keras packages to avoid unwanted logging from Tensorflow the back end.
40 41 42 |
K.set_image_dim_ordering('th') FLAGS = None |
set_image_dim_ordering specifies which dimension ordering convention Keras will follow.
As you may found in Keras documentation:
"tf"
assumes (rows, cols, channels)
while "th"
assumes (channels, rows, cols)
.Below we start by the end of the script where we build the our list of argument and call the function main with it.
159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
if __name__ == '__main__': # construct the argument parse and parse the arguments ap = ArgumentParser() ap.add_argument("-d", "--data_dir", required=True, help="path to the folder where data are stored") ap.add_argument("-l", "--logs", required=True, help="path to the folder to save logs and checkpoints") ap.add_argument("-m", "--models", required=True, help="path to the folder to save models") ap.add_argument("-n", "--model_number", required=True, type=int, help="model number") ap.add_argument("-ep", "--epochs", type=int, default=20, help="number of epochs") ap.add_argument("-bs", "--batch_size", type=int, default=50, help="number of epochs") ap.add_argument("-e", "--expand", required=False, default=True, type=str2bool, help="expand the training data with skew and shift") ap.add_argument("-r", "--reload", required=False, default=False, type=str2bool, help="reload previous best weights") #args = vars(ap.parse_args()) FLAGS, unparsed = ap.parse_known_args() FLAGS.logs = os.path.join(FLAGS.logs,'model{:02d}'.format(FLAGS.model_number)) main(FLAGS) |
Line 180, we insert the model number within the logs folder name.
Now let’s dive into the main procedure.
44 45 46 47 48 49 50 51 52 53 54 55 |
# main def main(FLAGS): # fix random seed for reproducibility seed = 7 np.random.seed(seed) print("Loading data..") loading_start_time = time.time() # load data (X_train, y_train), (X_test, y_test) = load_data(input_folder=FLAGS.data_dir, train_test_ratio=.8) print('Loading duration (s) : %.4f' % (time.time() - loading_start_time)) |
Lines 47-48, we fix the random seed to be able to reproduce the result between different runs.
Lines 53-54, we call our function to load and build the training and validation sets.
Then we normalize the data and one hot encode the labels.
57 58 59 60 61 62 63 64 65 66 67 |
# reshape to be [samples][pixels][width][height] X_train = X_train.reshape(X_train.shape[0], 1, 48, 48).astype('float32') X_test = X_test.reshape(X_test.shape[0], 1, 48, 48).astype('float32') # normalize inputs from 0-255 to 0-1 X_train = X_train / 255 X_test = X_test / 255 # one hot encode outputs y_train = np_utils.to_categorical(y_train) y_test = np_utils.to_categorical(y_test) num_classes = y_test.shape[1] |
Although we have been using Python for just six months, we are amazed by its simplicity and power.
The next step is to build the metadata and the sprite image for Tensorboard and its projector.
69 70 71 72 73 74 75 76 77 78 79 80 81 |
metadata_file = os.path.join(FLAGS.logs, 'h7as_cnn_metadata_4_classes.tsv') if not os.path.isfile(metadata_file): print("Saving metadata..") meta_start_time = time.time() y_size = min(10000,len(y_test)) # create metadata tsv file for tensorboard save_metadata(categories=y_test[:y_size], log_dir=FLAGS.logs) # print("X_test shape: ",X_test.shape) create_sprite_4_classes(images=X_test[:y_size], labels=y_test[:y_size], log_dir=FLAGS.logs) print('Saving metadata duration (s) : %.4f' % (time.time() - meta_start_time)) else: print("Metadata already exist.") |
Line 73, we limit the size of the sprite image to 10,000 thumbnails, that is, an image of 100 by 100.
Below we build our model in relation to the chosen one.
83 84 85 86 87 88 89 90 |
# build the model if FLAGS.model_number == 1: model = cnn_model_01(num_classes) elif FLAGS.model_number == 2: model = cnn_model_02(num_classes) else: print("Model number not found, building the first one...") model = cnn_model_01(num_classes) |
The argument of the function cnn_model_nn is the number of characters within our classification, that is 62 [0-9a-zA-Z].
102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
if FLAGS.reload == True: # load best weights print("Loading best weigths...") filepath = FLAGS.logs + "/h7as_handwriting_cnn_model"+ \ '{:03d}'.format(FLAGS.model_number)+"_weights_best.hdf5" filepath_bck = FLAGS.logs + \ "/h7as_handwriting_cnn_model"+'{:03d}'.format(FLAGS.model_number) + \ "_weights_best"+datetime.now().strftime("%Y%m%d%H%M")+".hdf5" shutil.copyfile(filepath, filepath_bck) model.load_weights(filepath) # Compile model (required to make predictions) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) print("Created model and loaded weights from previous run") |
The numbering of the lines presents a difference because in the current script we already have many more models.
Lines 111-114, we load the weights and compile again the model.
117 118 119 120 121 122 |
# define the checkpoint #filepath = FLAGS.logs + "/h7as_handwriting_cnn-{epoch:03d}-{loss:.4f}-best.hdf5" filepath= FLAGS.logs + "/h7as_handwriting_cnn_model" + \ '{:03d}'.format(FLAGS.model_number)+"_weights_best.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=0, save_best_only=True, mode='max') |
This callback function is very classic. You may find more inquiries in Keras documentation.
The next one is more complex as when we wrote this script Keras did not provide all parameters needed by the Tensorboard projector.
124 125 126 127 128 129 130 131 132 |
# define the projector for Tensorboard tbCallBack = TensorResponseBoard(log_dir=FLAGS.logs, histogram_freq=0, batch_size=200, write_graph=True, write_grads=False, write_images=False, embeddings_freq=5, embeddings_layer_names=['classes'], embeddings_metadata='h7as_cnn_metadata_4_classes.tsv', val_size=len(X_test), img_path='sprite_4_classes.jpg', img_size=[48, 48]) callbacks_list = [checkpoint, tbCallBack] |
The method TensorResponseBoard has been written by Mr Yu-Yang. You can find the code of this module at stackoverflow.com.
Now, we have just to call the fit function to train our model, this takes sometime.
134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
print("Training model...") training_start_time = time.time() # Fit the model history = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=FLAGS.epochs, batch_size=FLAGS.batch_size, callbacks=callbacks_list,verbose=2) print('Training duration (s) : %.02d' % (time.time() - training_start_time)) # summarize performance of the model print("Evaluating model...") Evaluation_start_time = time.time() # Final evaluation of the model scores = model.evaluate(X_test, y_test, verbose=0) print("CNN Model " + '{:03d}'.format(FLAGS.model_number) + \ " Error: %.2f%%" % (100-scores[1]*100)) print('Evaluation duration (s) : %.4f' % (time.time() - Evaluation_start_time)) # serialize model to JSON print("Saving model to disk...") model_json = model.to_json() with open(FLAGS.models+"/h7as_handwriting_cnn_model"+ \ '{:03d}'.format(FLAGS.model_number)+".json", "w") as json_file: json_file.write(model_json) |
Lines 138-140, we call the fit function with all our parameters.
Lines 155-158, we save our model with json format.
Depending on the number of epochs, the batch size, the size of the training and validation set, you get an accuracy above 99 percent.
1 2 3 4 |
# create model 02 # Epoch 30/30 # 132s - loss: 0.0296 - acc: 0.9932 - val_loss: 0.0080 - val_acc: 0.9981 # CNN Model 002 Error: 0.19% |
1 2 3 4 |
# create model 02 # Epoch 29/35 - Batch_size = 50 # 164s - loss: 0.0320 - acc: 0.9927 - val_loss: 0.0017 - val_acc: 0.9999 # CNN Model 002 Error: 0.12% |
So far, we have some models that have been trained and evaluated on a dataset, next we will test these models against other test datasets.
We will show this in a future article.
[1] – Théodore Bluche. Deep Neural Networks for Large Vocabulary Handwritten Text Recognition. Computers and Society [cs.CY]. Université Paris Sud – Paris XI, 2015. English. <NNT : 2015PA112062>.
[2] – Ciresan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep big simple neural nets for handwritten digit recognition. Neural computation, 22 (12), 3207–3220.
[3] – Thomas, S., Chatelain, C., Paquet, T., & Heutte, L. (2013). Un modèle neuro markovien profond pour l’extraction de séquences dans des documents manuscrits. Document numérique, 16 (2), 49–68.
[4] – LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1 (4), 541–551.
[5] – Matthew D. Zeiler, Rob Fergus. Visualizing and Understanding Convolutional Networks, arXiv:1311.2901v3 [cs.CV] 28 Nov 2013.
[1] MNIST database – http://yann.lecun.com/exdb/mnist/