IT Engeneering
With tilt and orientation corrections done, we can segment the page to extract lines, then words. To carry out the separation of the letters, we will have to detect the inclination of the handwriting.
The document may present deformations, however we will postpone this correction to later, if necessary.
In previous work to correct slope and orientation, we looked for lines and calculated the average leading. We will improve the process used to obtain homogeneous lines, and also calculate the characteristics of the handwriting: mean height, cap height, ascender height and descender height.
The civil status registers and more particularly the decennial tables do not have an identical layout, far from it.
Here are two examples from the departmental archives of “Nord” and “Yonne” [France].
The processing program must be able to process these different layouts and extract the lines. As the documents available on the Internet do not have enough resolution, if you do not enlarge them a bit to take a screenshot, we got parts of these pages.
This also simplifies the processing to be performed since we can extract the proper part of the document.
So we will deal with excerpts of documents like the ones shown below, with and without header.
From these documents we could directly locate the lines to extract, calculate the leading, the height of the letters. However, various copies of pages of registers have a layout with very marked columns that pollute the signal. We will start by locating the columns to remove these vertical lines and search horizontals only on a cleaner part.
Vertical projection profile, cspline2d, sepfir2d and argrelextrema are used to locate the grayer and lighter areas of the document.
Here is the function used to locate the darkest lines of the image.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# function look up for columns def columns_look_up(img): # the image must be a gray one (no channel) # transpose and flip the image to use the cspline2d function img_transpose = np.zeros((img.shape[1],img.shape[0]), np.uint8) cv2.transpose(img,img_transpose) img_flipped = cv2.flip(img_transpose,1) # find the darkest rows on the page (vertical) imgck = signal.cspline2d(img_flipped, 8.0) peaks_hollows = imgck.sum(axis=1) minInd = signal.argrelextrema(peaks_hollows, np.less, order=20) # compute the mean and stddev valleys = [] for val in minInd[0]: valleys.append(peaks_hollows[val]) stddev_hollows = np.std(valleys) mean_hollows = np.sum(valleys) / len(minInd[0]) # keep only the deepest valleys lower_limit_hollows = mean_hollows - stddev_hollows # find the columns on the page cols = [] for val in minInd[0]: if peaks_hollows[val] <= lower_limit_hollows: cols.append(val) return cols |
Lines 5-7, we prepare the image for the function signal cspline2d.
Lines 10-12, we look for the minima (hollow/darkest point) on the sum of the result return by the function.
Lines 12, We assigned the value 20 to the variable ‘order’ to consider 20 points on each side of the point being studied to keep the minimum. We could increase this value up to 60 to exclude certain parasitic minima. In this case, we should use the mean and not the mean minus the variance to select the points. Yet, it would not change the result and would only deal with a particular case. We just need to have enough columns to try to exclude the column of dates without other computation.
Lines 15-20, we compute the mean and variance of the hollows found.
Lines 23-28, we keep only the deepest hollows.
In the case presented above (two columns found), we will extract the right part of the image and look for the lines. We have the surnames, the first names and the date of the event. In the case where the central columns are more marked, we will extract only the surnames, or surnames and first names. We try to avoid extracting the column of dates that may have significant overheads. This is the case of the registers of the period 1792-1812, where the dates are specified in the Gregorian calendar and the revolutionary calendar.
Depending on the number of columns deducted, we will carry out the search on more or less wide area of the image by favoring its left part.
The graphs below show different measure using signal cspline2d, sepfir2d and argrelextrema functions.
All of these graphs give the variations between dark and light rows of the image. From the extreme points we can locate the baseline, the cap-line, the x-height line and descender line.
The graph at the top gives us the darkest rows (red dots), and the lightest ones (blue dots) for each possible text line. We are looking for the baseline, in other words a point somewhere between these blue and red dots, while removing the parasitic lines.
The graph at the bottom gives us something approaching the baselines (red dots) for each possible line. Nevertheless, there are too many parasitic lines.
The vertical projection of the extracted area is shown below (first graph above).
The red points give us the darkest part of the line which is, in most case, the center of the text but not the baseline. Here the first point must be omitted.
The algorithm that we implemented is the following (points graph no. 1), we try to minimize the variance of the distance:
Here is the code to find the different intensity variations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# compute cspline2d and sepfir2d sum_rows = imggrayed.sum(axis=1) imgck = signal.cspline2d(imggrayed, 8.0) derfilt = np.array([1.0, -2, 1.0], dtype=np.float32) imgfderiv = (signal.sepfir2d(imgck, derfilt, [1]) + signal.sepfir2d(imgck, [1], derfilt)) imgsderiv = (signal.sepfir2d(imgfderiv, derfilt, [1]) + signal.sepfir2d(imgfderiv, [1], derfilt)) peaks_hollows = imgck.sum(axis=1) maxInd_order = 16 maxInd = signal.argrelextrema(peaks_hollows, np.greater, order=maxInd_order) # here we don't get the last line - is there a bug ? minInd_order = 20 minInd = signal.argrelextrema(peaks_hollows, np.less, order=minInd_order) peaks_hollows_fd = imgfderiv.sum(axis=1) maxInd_fd_order = 14 maxInd_fd = signal.argrelextrema(peaks_hollows_fd, np.greater, order=maxInd_fd_order) minInd_fd_order = 10 minInd_fd = signal.argrelextrema(peaks_hollows_fd, np.less, order=minInd_fd_order) #------------------------------------------------------------- # to get the last line missing - minInd = signal.argrelextrema valleys = [] for val in minInd[0]: valleys.append(peaks_hollows[val]) stddev_hollows = np.std(valleys) mean_hollows = np.sum(valleys) / len(minInd[0]) lower_limit_hollows = mean_hollows + (stddev_hollows * 1.2) if maxInd_fd[0][-1] > (minInd[0][-1] * 1.1) and\ peaks_hollows[maxInd_fd[0][-1]] <= lower_limit_hollows: minInd = (np.append(minInd[0],maxInd_fd[0][-1]),) #----------End correction-------------------------------------- # for a part of a page - half page or less peaks_hollows_sd = imgsderiv.sum(axis=1) maxInd_sd_order = 10 maxInd_sd = signal.argrelextrema(peaks_hollows_sd, np.greater, order=maxInd_sd_order) minInd_sd_order = 10 minInd_sd = signal.argrelextrema(peaks_hollows_sd, np.less, order=minInd_sd_order) |
Lines 2-6, we start to compute vertical projection profile or histogram of the grayed image, then the cspline2d signal and its “derivatives”.
Lines 8-13, we compute the sum of the cspline2d signal given peaks and valleys of the curve.
Lines 9-10, we look for the peaks with an order of 16 (comparison with 16 points from each side of a point).
Lines 12-13, we look for the valleys/hollows with an order of 20 (comparison with 20 points from each side of a point).
We repeat the operation with the “derivatives” of the signal.
Lines 15-19, the computation is done for the first “derivative”.
Lines 38-42, the computation is done for the second “derivative”.
It seems that there is an anomaly in the function signal.argrelextrema np.greater, because regardless of the value of the variable order we do not get the last value. So we are making a small correction for this anomaly (Lines 21-35).
If we draw the lines found by the various functions mentioned above, we obtain the following results.
We first go through the list of red lines from the left graph and at the end select as baselines the nearest red lines below the writing from the middle graph.
As descender lines, we chose the first green lines below the baselines from the third graph.
As ascender lines, we chose the first green lines above the baselines from the first graph. In this case, this is not the real cap lines but a height which allows to extract all characters despite the variation from line to line.
We crop each line between the white line and the cyan one shown on the below graph by the red arrow.
A whole line is like this one.
The next step is to extract the words and slide through them with a window of a suitable size to extract the letters, one by one, and submit them to our models of character recognition. However, it is necessary to calculate the inclination of the handwriting, the width of each character and the distances to cut the words, then the letters as precisely as possible.
We will study these aspects in our next articles.
[1] – Scipy.org, Release: 0.17.0 (2016) Signal processing.
[1] Archives départementales du Nord – France – http://www.archivesdepartementales.lenord.fr
[2] Archives départementales de l’Yonne – France – http://archivesenligne.yonne-archives.fr