I would like to use page segmentation from Tesseract without running the OCR, as I have my own custom OCR model, and it takes to long to run page segmentation AND OCR. I tried using the --psm 2 mode in command line mode of Tesseract, and in pytesseract, and it didn't work as promised.
I'm working in Linux, and am coding in Python 3.10.
I currently use the tesseract-ocr-api from layoutparser Documentation. The code looks like the following:
import layoutparser as lp
ocr_agent = lp.TesseractAgent()
res = ocr_agent.detect(img_path, return_response=True)
layout_info = res['data']
The layout_info then is a pd.DataFrame and contains Layout information on the level of blocks, paragraph, lines and words and also the OCR output. The problem is that this is very slow; on my machine it takes 7s per image and I actually don't need the OCR. Hence, I want page segmentation (also sometimes called layout detection) only.
According to the Tesseract (Documentation), there is the --psm 2mode "Automatic page segmentation, but no OSD, or OCR". When I try this in the command line, this does not produce an output file (even if the output type is defined):
tesseract img.png outfile --psm 2
tesseract img.png outfile --psm 2 tsv
I also tried working with the python wrapper pytesseract, but it is quite slow and it again returns the pd.DataFrame with the layout AND OCR data, despite --psm 2 being specified:
import cv2
import pytesseract
img = cv2.imread(img_path)
layout_info = pytesseract.image_to_data(img, config='tsv --psm 2', output_type='data.frame')
I'm using pytesseract==0.3.10 and tesseract 5.3.3-30-gea0b.
Do you have any ideas on how I can achieve page segmentation without OCR with Tesseract (or at least speed up the processing time of page segmenation + OCR?
You can check, if -psm2 is implemented in your tesseract with the command:
tesseract --help-psm 2
Output on my machine:
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.
Gives you the Info:
--psm 2 Automatic page segmentation, but no OSD, or OCR. (not implemented)
Therefore, if not implemented you can't use it.
Process time is related to the image quality and amount of text. Have you a example, where ocr makes a time problem?
Have a look at tesserocr - python wrapper of tesseract API. With it you can access also functionality not available via the tesseract executable (pytesseract just wraps tesseract executable without direct access to its API).
I did not test it, but with tesserocr you can use AnalyseLayout without running Recognize - see function documentation in the tesseract source code.
Tesseract process time depends also on your hw (e.g. ssd vs hdd, availability of SSE/AVX or NEON instruction).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With