Article

Page segmentation

With tilt and orientation corrections done, we can segment the page to extract lines, then words. To carry out the separation of the letters, we will have to detect the inclination of the handwriting.

The document may present deformations, however we will postpone this correction to later, if necessary.

In previous work to correct slope and orientation, we looked for lines and calculated the average leading. We will improve the process used to obtain homogeneous lines, and also calculate the characteristics of the handwriting: mean height, cap height, ascender height and descender height.

 

Grantown on Spey

Grantown on Spey, Scotland, 1995

The civil status registers and more particularly the decennial tables do not have an identical layout, far from it.

Here are two examples from the departmental archives of “Nord” and “Yonne” [France].

Status register samples

Figure 1 – Status register samples

The processing program must be able to process these different layouts and extract the lines. As the documents available on the Internet do not have enough resolution, if you do not enlarge them a bit to take a screenshot, we got parts of these pages.

This also simplifies the processing to be performed since we can extract the proper part of the document.

So we will deal with excerpts of documents like the ones shown below, with and without header.

Part of a page

Figure 2 – Part of a page without header

Another specimen

Figure 3 – The same specimen with header

From these documents we could directly locate the lines to extract, calculate the leading, the height of the letters. However, various copies of pages of registers have a layout with very marked columns that pollute the signal. We will start by locating the columns to remove these vertical lines and search horizontals only on a cleaner part.

Locating columns

Vertical projection profile, cspline2d, sepfir2d and argrelextrema are used to locate the grayer  and lighter areas of the document.

  • Vertical projection profile is the sum of all pixel values in a row along the y axis (grayed image).
  • Signal cspline2d function returns the third-order B-spline coefficients over a regularly spaced input grid for the two-dimensional input image.
  • Signal sepfir2d Convolve the rank-2 input array with the separable filter defined by the rank-1 arrays hrow, and hcol.
  • Signal argrelextrema calculates the relative extrema of data.

Here is the function used to locate the darkest lines of the image.

Lines 5-7, we prepare the image for the function signal cspline2d.

Lines 10-12, we look for the minima (hollow/darkest point) on the sum of the result return by the function.

Lines 12, We assigned the value 20 to the variable ‘order’ to consider 20 points on each side of the point being studied to keep the minimum. We could increase this value up to 60 to exclude certain parasitic minima. In this case, we should use the mean and not the mean minus the variance to select the points. Yet, it would not change the result and would only deal with a particular case. We just need to have enough columns to try to exclude the column of dates without other computation.

Lines 15-20, we compute the mean and variance of the hollows found.

Lines 23-28, we keep only the deepest hollows.

locate columns

Figure 3 – Page segmentation locate columns

In the case presented above (two columns found), we will extract the right part of the image and look for the lines. We have the surnames, the first names and the date of the event. In the case where the central columns are more marked, we will extract only the surnames, or surnames and first names. We try to avoid extracting the column of dates that may have significant overheads. This is the case of the registers of the period 1792-1812, where the dates are specified in the Gregorian calendar and the revolutionary calendar.

Computing leading and locating lines

Depending on the number of columns deducted, we will carry out the search on more or less wide area of the image by favoring its left part.

The graphs below show different measure using signal cspline2d, sepfir2d and argrelextrema functions.

locate lines

Figure 4 – Page segmentation locate lines

All of these graphs give the variations between dark and light rows of the image. From the extreme points we can locate the baseline, the cap-line, the x-height line and descender line.

The graph at the top gives us the darkest rows (red dots), and the lightest ones (blue dots) for each possible text line. We are looking for the baseline, in other words a point somewhere between these blue and red dots, while removing the parasitic lines.

The graph at the bottom gives us something approaching the baselines (red dots) for each possible line. Nevertheless, there are too many parasitic lines.

The vertical projection of the extracted area is shown below (first graph above).

locate lines

Figure 4 – Page segmentation locate lines

The red points give us the darkest part of the line which is, in most case, the center of the text but not the baseline. Here the first point must be omitted.

The algorithm that we implemented is the following (points graph no. 1), we try to minimize the variance of the distance:

  • Calculate the average of the distances between the minima from the first point or a point at 20% of the number of values. (skip headers, and process documents with few lines)
  • Select the points that have a distance with the successor between the calculated average +/- a margin of error.
  • Calculating a first line spacing from the selected points.
  • Checking points from the last point and excluding points that have a distance with the successor not between the calculated average +/- margin of error.
  • Selection of points from the starting point to the beginning of the list excluding points which have a distance with the successor not between the calculated average +/- margin of error and which are too close to the top of the list picture.
  • Then for each point selected, we get the nearest point greater in the list of the minima of the third graph (better baseline approximation), if their distance is between the leading computed and a margin.
  • Search the cap-height from the points that show a lighter row in the image before the body of the line.
  • Search the descender-height from the points that show a lighter row in the image after the baseline.
  • For each baseline crop the text and save a slice of the image.

Here is the code to find the different intensity variations.

Lines 2-6, we start to compute vertical projection profile or histogram of the grayed image, then the cspline2d signal and its “derivatives”.

Lines 8-13, we compute the sum of the cspline2d signal given peaks and valleys of the curve.

Lines 9-10, we look for the peaks with an order of 16 (comparison with 16 points from each side of a point).

Lines 12-13, we look for the valleys/hollows with an order of 20 (comparison with 20 points from each side of a point).

We repeat the operation with the “derivatives” of the signal.

Lines 15-19, the computation is done for the first “derivative”.

Lines 38-42, the computation is done for the second “derivative”.

It seems that there is an anomaly in the function signal.argrelextrema np.greater, because regardless of the value of the variable order we do not get the last value. So we are making a small correction for this anomaly (Lines 21-35).

If we draw the lines found by the various functions mentioned above, we obtain the following results.

functions visualization

Figure 5 – Page segmentation functions visualization

We first go through the list of red lines from the left graph and at the end select as baselines the nearest red lines below the writing from the middle graph.

As descender lines, we chose the first green lines below the baselines from the third graph.

As ascender lines, we chose the first green lines above the baselines from the first graph. In this case, this is not the real cap lines but a height which allows to extract all characters despite the variation from line to line.

We crop each line between the white line and the cyan one shown on the below graph by the red arrow.

line extraction

Figure 6 – Page segmentation line extraction

A whole line is like this one.

extracted line sample

Figure 8 – Page segmentation extracted line sample

The next step is to extract the words and slide through them with a window of a suitable size to extract the letters, one by one, and submit them to our models of character recognition. However, it is necessary to calculate the inclination of the handwriting, the width of each character and the distances to cut the words, then the letters as precisely as possible.

We will study these aspects in our next articles.

References

[1] – Scipy.org, Release: 0.17.0 (2016) Signal processing.

Resources

[1] Archives départementales du Nord – France – http://www.archivesdepartementales.lenord.fr

[2] Archives départementales de l’Yonne – France – http://archivesenligne.yonne-archives.fr 

Navigation

Social Media