Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Tesseract OCR recognize subscripts and superscripts?

Tags:

ocr

tesseract

I have problems with the general recognition of subscript and superscript in text fragments.

Example-image:

example-image with subscript and superscript

I used Tesseract 4.1.1 with the training data available under https://github.com/tesseract-ocr/tessdata_best. The numerous options had default values except:

  • tessedit_create_hocr = 1 (to get result as HOCR)
  • hocr_font_info = 1 (to get additional font infos like font size)
  • hocr_char_boxes = 1 (to get character-based result)

The language was set to eng. Neither with page segmentation mode 3 (PSM_AUTO_OSD) nor 11 (PSM_SPARSE_TEXT) nor 12 (PSM_SPARSE_TEXT_OSD) the subscript/superscript was recognized correctly.

In the output the sub/sup-fragments were all more or less wrong:

  • "SubtextSub" is recognized as "Subtextsu,"
  • "SuptextSub" is recognized as "Suptexts?"
  • "P0" is recognized as "Po"
  • "P100" is recognized as "P1go"
  • "a2+b2" is recognized as "a+b?"

Using Tesseract for OCR is there a way to ...?

  1. optimize subscript/superscript handling
  2. get infos about recognized subscript/superscript (in the hocr-output - ideally for each character)
like image 768
MaS Avatar asked Nov 07 '25 16:11

MaS


1 Answers

Working on the quality of the image as suggested in other questions/answers to this topic didn't really change anything.

Following these 2 links from the tesseract-google-newsgroup at first it really seemed to be a question of training: link1 and link2.

But after doing some experiments I found out, that the used OEM_DEFAULT-OCR engine mode just doesn't bring up the needed information. I found a partial solution to the problem. Partial, because I now get most infos about sub/sup and also the recognized characters are right in most cases, but not for all characters.

Using the OEM_TESSERACT_ONLY-OCR engine mode (=the legacy mode) and some API methods provided by Tess4J I came up with the following java test class:

public class SubSupEvaluator {
    public void determineSubSupCharacters(BufferedImage image) {
        //1. initialize Tesseract and set image infos
        TessBaseAPI handle = TessAPI1.TessBaseAPICreate();
        try {
            int bpp = image.getColorModel().getPixelSize();
            int bytespp = bpp / 8;
            int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0);
            TessBaseAPIInit2(handle, new File("./tessdata/").getAbsolutePath(), "eng", TessOcrEngineMode.OEM_TESSERACT_ONLY);
            TessBaseAPISetPageSegMode(handle, TessPageSegMode.PSM_AUTO_OSD);
            TessBaseAPISetImage(handle, ImageIOHelper.convertImageData(image), image.getWidth(), image.getHeight(), bytespp, bytespl);

            //2. start actual OCR run
            TessBaseAPIRecognize(handle, null);

            //3. iterate over the result character-wise
            TessResultIterator ri = TessBaseAPIGetIterator(handle);
            TessPageIterator pi = TessResultIteratorGetPageIterator(ri);
            TessPageIteratorBegin(pi);
            do {
                //determine character
                Pointer ptr = TessResultIteratorGetUTF8Text(ri, TessPageIteratorLevel.RIL_SYMBOL);
                String character = ptr.getString(0);
                TessDeleteText(ptr); //release memory

                //determine position information
                IntBuffer leftB = IntBuffer.allocate(1);
                IntBuffer topB = IntBuffer.allocate(1);
                IntBuffer rightB = IntBuffer.allocate(1);
                IntBuffer bottomB = IntBuffer.allocate(1);
                TessPageIteratorBoundingBox(pi, TessPageIteratorLevel.RIL_SYMBOL, leftB, topB, rightB, bottomB);

                //write info to console
                System.out.println(String.format("%s - position [%d %d %d %d], subscript: %b, superscript: %b", character, leftB.get(), topB.get(),
                    rightB.get(), bottomB.get(), TessAPI1.TessResultIteratorSymbolIsSubscript(ri) == TessAPI1.TRUE,
                    TessAPI1.TessResultIteratorSymbolIsSuperscript(ri) == TessAPI1.TRUE));
            } while (TessPageIteratorNext(pi, TessPageIteratorLevel.RIL_SYMBOL) == TessAPI1.TRUE);
        } finally {
            TessBaseAPIDelete(handle); //release memory
        }
    }
}

The legacy mode only works with 'normal' training data. Using the '-best' training data is bringing an error.

like image 147
MaS Avatar answered Nov 12 '25 17:11

MaS



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!