I'm writing a python script where I use multiproccesing library to launch multiple tesseract instances in parallel. when I use multiple calls to tesseract but in sequence using loop ,it works .However ,when I try to parallel code everything looks fine but I'm not getting any results (I waited for 10 minutes ).
In my code I try to Ocrize multiple pdf pages after I split them from the original multi page PDF.
Here's my code :
def processPage(i):
nameJPG="converted-"+str(i)+".jpg"
nameHocr="converted-"+str(i)
p=subprocess.check_call(["tesseract",nameJPG,nameHocr,"-l","eng","hocr"])
print "tesseract did the job for the ",str(i+1),"page"
pool1=Pool(4)
pool1.map(processPage, range(len(pdf.pages)))
As what i know of pytesseract it will not allow multiple processes if you have quadcore and you are running 4 processes simultaneously than tesseract will be choked and you will have high cpu usage and other stuffs if you require this for company and you dont want to go with google vision api you have to set multiple servers and do socket programming to request text from different servers so that number of parallel process are less than ability of your server to run different processes at same time like for quad core it should be 2 or 3 or other wise you can hit google vision api they have lot of servers and there output is quite good too Disabling multiprocessing in tesseract will also help It can be done by setting OMP_THREAD_LIMIT=1 in the environment. but you must not run multiple process at same servers for tesseract
See https://github.com/tesseract-ocr/tesseract/issues/898#issuecomment-315202167
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With