Identify text areas on a Talmud page

Question

I have a Talmud page like these: enter image description here And I want to find the text areas with opencv to get such a result, that each text will be on its own like this:

enter image description here

In the attached image, each area is marked in a different color, and text has a number, what is important is to identify the area belonging to each text, and differentiate it from the area belonging to another text, the numerical order does not matter.

Doing it with the eyes is really easy, according to the white stripes that pass between the texts, but I tried to do it with opencv and I could not.

In the following code I try to catch all the letters and turn them into black rectangles, Then magnify each rectangle to meet with a neighboring rectangle, And so the whole area of the text will be black, and between the texts there will be a clear white stripe.

I do not know how to proceed, and if it is a good approach.

public List<Rectangle> getRects(Mat grayImg)
{
    BlobCounter blobCounter = new BlobCounter();
    blobCounter.ObjectsOrder = ObjectsOrder.None;
    blobCounter.ProcessImage(grayImg);
    IEnumerable<Blob> blobs = blobCounter.GetObjectsInformation();

    var blackBlobs = grayImg.Clone;
    foreach (var b in blobs)
        blackBlobs.Rectangle(b.Rectangle.ToCvRect, Scalar.Black, -1);

    var widths = blobs.Select(X => X.Rectangle.Width).ToList;
    widths.Sort();
    var median = widths(widths.Count / (double)2);

    Mat erodet = new Mat();
    Cv2.Erode(grayImg, erodet, null, iterations: median);

    using (Window win = new Window())
    {
        win.ShowImage(erodet);
        win.WaitKey();
    }
}

Thanks in advance, any help would be appreciated.

Additional clarification:

As you can see in the previous image, the text areas are not rectangular, But these areas can be described as a collection of rectangles of different sizes arranged in a pile, one on top of the other.

Note that when two rectangles belong to the same text, do not arrange one rectangle next to another rectangle, but only one above the other.

What I am trying to achieve is a collection of these rectangles and knowing each rectangle to which text it belongs.

An answer can be in any programming language, especially in C++ Python and C#

Shai · Accepted Answer

I believe this task can be done mostly using morphological operations.
It is easier to show the concept in matlab, but opencv has equivalent operations.

We start with a rough estimation of the size of the gap between the different sections of the page. Looking at your example, the gap is about 1% of the page's height.

img = im2single(rgb2gray(imread('https://i.stack.imgur.com/LoV5x.jpg')));  % read the image into 1ch gray scale image in range [0, 1]
gap = ceil(size(img,1) * 0.01);  % gap estimation

First, we would like to use image dilation to create a mask where all words in the same section are connected to each other:

d1 = imdilate(img < 0.5, ones(gap));

Resulting with:
enter image description here

(If it wasn't for the annoying words from the next page the printer adds at the bottom of each section we would have been done...)

There are some large gaps the dilation did not fill, we can use floodfill to complete them:

f = imfill(d1, 'holes');

Now we have full masks for the text regions:
enter image description here

Using erosion to cut between the different sections:

e = imerode(f, ones(1, 5*gap));  % erosion only horizontally

Resulting with correct partition, although too thinned:
enter image description here

Dilating back

d2 = imdilate(e, ones(1, 5*gap));

gives this binary mask:
enter image description here

You can now simply look at the connected components of this binary mask:
enter image description here

I hope this will count as a "Daf Yomi" for me...

Update:
The next step - going from segments to rectangular polygons requires some geometrical operations, I'll outline the approach here and leave the implementation details to you.
Eventually, we want a bounding polygon for each segment, with the basic polygon being the rectangle bounding box of the segment. You'll have to implement this "polygon" class. A crucial method of this class is "polygon subtraction" - that is poly_result = poly_a - poly_b create a new polygon poly_result which is poly_a minus the intersection between poly_a and poly_b.

Here's the algorithm:

For each segment compute it's bounding box, the area of the bounding box and the actual number of pixels in the segment.
Init the polygon of each segment to its bounding box.
Sort the segments based on the ratio between the number of pixels and the bounding box area in a descending order.
For each segment in a descending order:
subtract all previous polygons from this segment's.

You should get something like this:

enter image description
And for the second image:

Identify text areas on a Talmud page

Tags:

image-processing

opencv

computer-vision

ocr

image-segmentation

google dev

Video Answer

1 Answers

Shai

Recent Activity

Donate For Us

Identify text areas on a Talmud page

Tags:

image-processing

opencv

computer-vision

ocr

image-segmentation

google dev

Video Answer

1 Answers

Shai

Related questions

Recent Activity

Donate For Us