While trying to run this command:
tesseract bond111.tif bond111 batch.nochop makebox
I get the next error
Error in pixReadFromTiffStream: spp not in set {1,3}
Error in pixReadStreamTiff: pix not read
Error in pixReadTiff: pix not read
Assuming that spp not in set
is the main error here, what does it mean?
At first it had trouble because the bpp was higher than 24 so I reduced it using Gimp but that did not resolve the issue.
It probably means your TIFF image has an alpha channel and therefore the underlying Leptonica library used by Tesseract doesn't support it. If you're using Imagemagick then be aware that operations such as -draw
can cause alpha channels to be added. If you're using convert
in your workflow and want to remove the channel again immediately, flatten the image before writing by adding -background white -flatten +matte
before the output filename, e.g.:
convert input.tiff -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte output.tiff
Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.
Sources: magick-users mailing list posting; tesseract-ocr mailing list posting
Thanks for your post ZakW, you pointed me to the right direction. Anyhow i also needed to set '-depth 8'. Quality was not good enough for OCR, whatever I tried.
What worked for me is this solution:
ghostscript -o document.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw document.pdf
tesseract document.tiff document -l deu
vim document.txt
This way I got perfect text with Umlauts in german.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With