I want to automatically divide an image of ancient handwritten text by lines (and by words in future).
I'm just using a simple digitization (based on brightness of pixel). After that I store data into two-dimensional array.
My first algorithm was pretty simple - if there are more black pixels in a row of the array than the root-mean-square of Maximum and Minimum value, then this row is part of line.
After forming the list of lines I cut off lines with height that is less than average. Finally it turned out into some kind of linear regression, trying to minimize the difference between the blank rows and text rows. (I assumed that fact)
My second attempt - I tried to use GA with several fitness functions. The chromosome contained 3 values - xo, x1, x2. xo [-1;0] x1 [0;0.5] x2 [0;0.5]
Function, that determines identity the row to line is (xo + α1 x1 + α2 x2) > 0, where α1 is scaled sum of black pixels in row, α2 is median value of ranges between the extreme black pixels in row. (a1,a2 [0,1]) Another functions, that I tried is (x1 < α1 OR x2 > α2) and (1/xo + [a1 x1] / [a2 x2] ) > 0 The last function is the most efficient. The fitness function is (1 / (HeigthRange + SpacesRange)
Where range is difference between maximum and minimum. It represents the homogeneity of text. The global optimum of this function - the most smooth way to divide the image into lines.
I am using C# with my self-coded GA (classical, with 2-point crossover, gray-code chromosomes, maximum population is 40, mutation rate is 0.05)
Now I ran out of ideas how to divide this image into lines with ~100% accuracy.
What is the efficient algorithm to do this?
UPDATE: Original BMP (1.3 MB)
UPDATE2: Improved results on this text to 100%
How I did it:
Problem:
GA surprisingly failed to recognize this line. I looked at debug data of 'find rages' function and found, that there is too much noise in 'unrecognized' place. The function code is below:
public double[] Ranges()
{
var ranges = new double[_original.Height];
for (int y = 0; y < _original.Height; y++ )
{
ranges[y] = 0;
var dx = new List<int>();
int last = 0;
int x = 0;
while (last == 0 && x<_original.Width)
{
if (_bit[x, y])
last = x;
x++;
}
if (last == 0)
{
ranges[y] = 0;
continue;
}
for (x = last; x<_original.Width; x++)
{
if (!_bit[x, y]) continue;
if (last != x - 1)
{
dx.Add((x-last)+1);
}
last = x;
}
if (dx.Count > 2)
{
dx.Sort();
ranges[y] = dx[dx.Count / 2];
//ranges[y] = dx.Average();
}
else
ranges[y] = 0;
}
var maximum = ranges.Max();
for (int i = 0; i < ranges.Length; i++)
{
if (Math.Abs(ranges[i] - 0) < 0.9)
ranges[i] = maximum;
}
return ranges;
}
I'm using some hacks in this code. The main reason - I want to minimize the range between nearest black pixels, but if there are no pixels, the value becomes '0', and it becomes impossible to solve this problem with finding optimas. The second reason - this code is changing too frequently. I'll try to fully change this code, but I have no idea how to do it.
Q:
Character Segmentation is the most crucial step for any OCR (Optical Character Recognition) System. The selection of segmentation algorithm being used is the key factor in deciding the accuracy of OCR system. If there is a good segmentation of characters, the recognition accuracy will also be high.
To do such segmentations, there are two major approaches: (i) Manually analyzing the characters of text to get some heuristic approaches; (ii) Doing annotations for the sample corpus with boundary information, then adopting some machine learning (ML) methods to learn from annotated corpus, and finally doing automatic ...
Text segmentation is the task of dividing a document of text into coherent and semantically meaningful segments which are contiguous. This task is important for other Natural Language Processing (NLP) applications like summarization, context understanding, and question-answering.
In geometry, a line segment is bounded by two distinct points on a line. Or we can say a line segment is part of the line that connects two points. A line has no endpoints and extends infinitely in both the direction but a line segment has two fixed or definite endpoints.
Although I'm not sure how to translate the following algorithm into GA (and I'm not sure why you need to use GA for this problem), and I could be off base in proposing it, here goes.
The simple technique I would propose is to count the number of black pixels per row. (Actually it's the dark pixel density per row.) This requires very few operations, and with a few additional calculations it's not difficult to find peaks in the pixel-sum histogram.
A raw histogram will look something like this, where the profile along the left side shows the number of dark pixels in a row. For visibility, the actual count is normalized to stretch out to x = 200.
After some additional, simple processing is added (described below), we can generate a histogram like this that can be clipped at some threshold value. What remains are peaks indicating the center of lines of text.
From there it's a simple matter to find the lines: just clip (threshold) the histogram at some value such as 1/2 or 2/3 the maximum, and optionally check that the width of the peak at your clipping threshold is some minimum value w.
One implementation of the full (yet still simple!) algorithm to find the nicer histogram is as follows:
The "vertical count" (step 3) eliminates horizontal strokes that happen to be located above or below the center line of text. A more sophisticated algorithm would just check directly above and below (x,y), but also to the upper left, upper right, lower left, and lower right.
With my rather crude implementation in C# I was able to process the image in less than 75 milliseconds. In C++, and with some basic optimization, I've little doubt the time could be cut down considerably.
This histogram method assumes the text is horizontal. Since the algorithm is reasonably fast, you may have enough time to calculate pixel count histograms at increments of every 5 degrees from the horizontal. The scan orientation with the greatest peak/valley differences would indicate the rotation.
I'm not familiar with GA terminology, but if what I've suggested is of some value I'm sure you can translate it into GA terms. In any case, I was interested in this problem anyway, so I might as well share.
EDIT: maybe for use GA, it's better to think in terms of "distance since previous dark pixel in X" (or along angle theta) and "distance since previous dark pixel in Y" (or along angle [theta - pi/2]). You might also check distance from white pixel to dark pixel in all radial directions (to find loops).
byte[,] arr = get2DArrayFromBitamp(); //source array from originalBitmap int w = arr.GetLength(0); //width of 2D array int h = arr.GetLength(1); //height of 2D array //we can use a second 2D array of dark pixels that belong to vertical strokes byte[,] bytes = new byte[w, h]; //dark pixels in vertical strokes //initial morph int r = 4; //radius to check for dark pixels int count = 0; //number of dark pixels within radius //fill the bytes[,] array only with pixels belonging to vertical strokes for (int x = 0; x < w; x++) { //for the first r rows, just set pixels to white for (int y = 0; y < r; y++) { bytes[x, y] = 255; } //assume pixels of value < 128 are dark pixels in text for (int y = r; y < h - r - 1; y++) { count = 0; //count the dark pixels above and below (x,y) //total range of check is 2r, from -r to +r for (int j = -r; j <= r; j++) { if (arr[x, y + j] < 128) count++; } //if half the pixels are dark, [x,y] is part of vertical stroke bytes[x, y] = count >= r ? (byte)0 : (byte)255; } //for the last r rows, just set pixels to white for (int y = h - r - 1; y < h; y++) { bytes[x, y] = 255; } } //count the number of valid dark pixels in each row float max = 0; float[] bins = new float[h]; //normalized "dark pixel strength" for all h rows int left, right, width; //leftmost and rightmost dark pixels in row bool dark = false; //tracking variable for (int y = 0; y < h; y++) { //initialize values at beginning of loop iteration left = 0; right = 0; width = 100; for (int x = 0; x < w; x++) { //use value of 128 as threshold between light and dark dark = bytes[x, y] < 128; //increment bin if pixel is dark bins[y] += dark ? 1 : 0; //update leftmost and rightmost dark pixels if (dark) { if (left == 0) left = x; if (x > right) right = x; } } width = right - left + 1; //for bins with few pixels, treat them as empty if (bins[y] < 10) bins[y] = 0; //normalize value according to width //divide bin count by width (leftmost to rightmost) bins[y] /= width; //calculate the maximum bin value so that bins can be scaled when drawn if (bins[y] > max) max = bins[y]; } //calculated the smoothed value of each bin i by averaging bin i-1, i, and i+1 float[] smooth = new float[bins.Length]; smooth[0] = bins[0]; smooth[smooth.Length - 1] = bins[bins.Length - 1]; for (int i = 1; i < bins.Length - 1; i++) { smooth[i] = (bins[i - 1] + bins[i] + bins[i + 1])/3; } //create a new bitmap based on the original bitmap, then draw bins on top Bitmap bmp = new Bitmap(originalBitmap); using (Graphics gr = Graphics.FromImage(bmp)) { for (int y = 0; y < bins.Length; y++) { //scale each bin so that it is drawn 200 pixels wide from the left edge float value = 200 * (float)smooth[y] / max; gr.DrawLine(Pens.Red, new PointF(0, y), new PointF(value, y)); } } pictureBox1.Image = bmp;
After fiddling around this for a while I found that I simply need to count the number of crossings for each line, that is, a switch from white to black would count as one, and a switch from black to white would increment by one again. By highlighting each line with a count > 66 I got close to 100% accuracy, except for the bottom most line.
Of course, would not be robust to slightly rotated scanned documents. And there is this disadvantage of needing to determine the correct threshold.
IMHO with the image shown that would be so hard to do 100% perfectly. My answer is to give you alternate idea's.
Idea 1: Make your own version of ReCaptcha (to put on your very own pron site) - and make it a fun game.. "Like cut out a word (edges should all be white space - with some tolerance for overlapping chars on above and below lines)."
Idea 2: This was a game we played as kids, the wire of a coat hanger was all bent in waves and connected to a buzzer and you had to navigate a wand with a ring in the end with the wire through it, across one side to the other without making the buzzer go off. Perhaps you could adapt this idea and make a mobile game where people trace out the lines without touching black text (with tolerance for overlapping chars)... when they can do a line they get points and get to new levels where you give them harder images..
Idea 3: Research how google/recaptcha got around it
Idea 4: Get the SDK for photoshop and master the functionality of it Extract Edges tool
Idea 5: Stretch the image heaps on the Y Axis which should help, apply the algorithm, then reduce the location measurements and apply them on the normal sized image.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With