Article

Line segmentation

We now have the lines of a page of a register. We are progressing in the decomposition of the handwritten information to be able to submit a unit of data to the models of character recognition.

As much as the segmentation of a page into lines is relatively simple, because the leading is discernible. As much as the split of a line into congruent words or groups is more laborious.

The main reasons are:

  • the quality of the support (contrast, noise, defects, etc.),
  • the overall layout of the page (column),
  • the distribution of text on the line (hyphenation),
  • the nature of the calligraphy (beautiful, readable, well drawn or not).

 

Guillemots, Catterline, Scotland

Guillemots (Uria), Catterline, Scotland, 1995

Here are some examples stacked in a single image.

lines taken from registers

Figure 1 – Several lines taken from registers

For the first line, the words are well distributed across the width. It is not difficult to extract them, even if the contrast and the columns are not very pronounced.

The second line presents a difficulty for the date part, where several dates are noted one above the other.

The third has two characteristics: the beginning of the line with a capital D, the sequence of first names that are not well spaced and written in abbreviated form.

The fourth and fifth present a difficulty in the connection between the words (first names).

Of the five lines presented, four show a handwriting bent, a slant to the right of about 45 °. We found some registers with different inclinations in writing between the surname and the rest of the written information: first names, annotations and dates.

Straight forward approach

At first, we will not worry about the inclination of the handwriting. We will search for spaces between words or groups of words. The date will be extracted in its entirety, without taking into account any columns.

From the below line we will get four slices like the ones shown.

line sample

Figure 2 – line sample

stacked slices

Figure 3 – Four stacked slices

Then we can go back over the different parts, for one hand, calculate the inclination and other measures, on the other hand, try to extract bits of words, letters or numbers.

The method used in the first phase of the segmentation is similar to that used for pages. The functions cspline2d, sepfir2d and argrelextrema allow us to locate the points of interest for the division.

The example show green and red lines that are taken from the called “second derivative” where red points give darkest columns (red lines) and blue points give lightest ones (green lines) for specific orders in the argrelextrema function.

points of interest

Figure 4 – Curves and points of interest

We try to detect columns, then for each part we recalculate the functions to determine where the gaps are. The distances between points help us to distinguish a space between characters and words. The space between them is usually larger. However, there are always exceptions.

In the following example our method does not allow an optimum division.

awkward points

Figure 5 – Some awkward points

The picture above show four awkward points : tight space, superscript letter or abbreviated form, character straddling two columns. The blue oblique lines show that the detection of the inclination will bring us a clear improvement in the segmentation.

Without measure of inclination we cut the line like this.

without slant measure

Figure 6 – Line cut without slant measure

The next figure shows the curves when the slant has been measured. The angle is about 57°.

Slant and segmentation

Figure 7 – Slant and segmentation

There is a more pronounced differentiation between dark and light lines. The separation of words is more readable. However, the distances between the characters must be refined.

A coarse division of the words, with the calculation of the inclination, gives us this result.

Coarse cut

Figure 8 – Coarse cut of words and characters.

If the result is encouraging, we have to note some aspects to consider. The most suitable cut between two letters does not systematically follow the inclination of the writing. In example, between the “o” and the “l” of the surname, and the “a” and the “d” of the first name. It is therefore necessary to recalculate the optimum as the “window” progresses from left to right.

Also, we need to proceed in stages: first extract the words, then the characters. Our goal here is to get the finest possible clipping to get individual characters to broaden our datasets to train artificial intelligence models.

Reading the article by [1] Farès Menasri et al. gives us directions to follow.

Word segmentation: Our approach to multi-word recognition is based on an explicit word segmentation. This approach assumes that the words in a line can be segmented based on the spaces between them. This condition is usually met for clean documents written in European languages. The word segmentation module takes an image of a line as input, and outputs a weighted segmentation graph which contains the different hypotheses of the segmentation of the line into words.

The use of Connectionist Temporal Classification and Weighted Finite State Transducers in the handwriting recognition system describe in the paper is interesting and require a deepening of these methods.

Our approach is alike as this one, we assumes that the words in a line can be segmented based on the spaces between them. Yet, we will explore another way which is the distance between characters, the width and the height of each characters.

Measures in typography

Figure 9 – Measures in typography

The English alphabet has several groups of lowercase letters :

  • [aceinorsuvx] with a width of “1u”,
  • [mw] with a width of “~2u”,
  • [l] ascender with a width of “~1/2u”,
  • [bdkht] ascender with a width of “1u”,
  • [gjpqyz] descender with a width of “1u”,
  • [f] ascender-descender with a width of “1u”.

By now we have a set of scripts that we must gather and refactor. Some coding days to go.

References

[1] – Farès Menasri, Jérôme Louradour, Anne-Laure Bianne-Bernard and Christopher Kermorvant, (2012) The A2iA French handwriting recognition system at the Rimes-ICDAR2011 competition.

Resources

[1] Archives départementales du Nord – France – http://www.archivesdepartementales.lenord.fr

[2] Archives départementales de l’Yonne – France – http://archivesenligne.yonne-archives.fr 

Navigation

Social Media