I've written a script in python
using pytesseract
to get the text embedded in an image. When I run my script, the scraper does it's job weirdly, meaning the text I get as result is quite different from what is in the image.
Script I've tried with:
import requests, io, pytesseract
from PIL import Image
response = requests.get('http://skoleadresser.no/4DCGI/WC_Pedlex_Adresse/864928.jpg')
img = Image.open(io.BytesIO(response.content))
imagetext = pytesseract.image_to_string(img)
print(imagetext)
The text in the image look like:
Result I'm having:
Adresse WM 0an Hanssensm 7 A
4u21 Slavanqer
warm 52 m so no
Te‘efaks 52 m 90 m
E'Dus‘x Van’s strandflanlmu
How can I get the accurate result?
tl;dr:
import requests
import io
import pytesseract
from PIL import Image
response = requests.get('http://skoleadresser.no/4DCGI/WC_Pedlex_Adresse/864928.jpg')
img = Image.open(io.BytesIO(response.content))
width, height = img.size
new_size = width*6, height*6
img = img.resize(new_size, Image.LANCZOS)
img = img.convert('L')
img = img.point(lambda x: 0 if x < 155 else 255, '1')
imagetext = pytesseract.image_to_string(img)
print(imagetext)
Results in:
Adresse Prof. Olav Hanssens vei 7 A
4021 StavangerTelefon 52 70 90 00
Telefaks 52 70 90 01
E-post [email protected]
OCR is designed to scan letters from a printed, handwritten or typed document which is scanned at a high resolution, with basically no blur - maybe there exist some tools which are dedicated to scan digital images with a low resolution and a lot of blur, but in general they can't guess letters from such input data at any reasonable rate - it is just too blurry and has too few pixels that an OCR tool can make something useful with this data.
This may sound as if there is little chance to get it working nonetheless - just scaling it up without any further processing doesn't do the trick, as you'll see later on, the image would still be too far away from resembling anything like a typed/printed text.
I did some trial and error with the scaling factor and found 6 to be working the best with this image, so:
width, height = img.size
new_size = width*6, height*6
Scaling it up by factor 6 without any resampling:
img = img.resize(new_size)
Gives us this image, which is pretty useless because it is basically the exact same unreadable image as before, just that 1px*1px is now 6px*6px (notice the grey areas which almost intersect between the letters - especially Pr
, s
and k
will lead to big problems):
Fortunately there are some resampling formulas which are giving very good results, for PIL there is PIL.Image.LANCZOS
(amongst others) which applies the Lanczos resampling formula:
img = img.resize(new_size, Image.LANCZOS)
The difference may not seem that huge at first - but now we have a better fill for the letters instead of those black and grey blocks - and a much more natural blur which we can work with in the next step. Looking now at Pr
, s
and k
we see that they don't intersect this badly anymore.
What needs to be done next in order to make the image look more like an actually printed document, making it black and white by removing the blur - first step is to make the image work with mode L
(8 bit pixels b/w)
img = img.convert('L')
Of course there is virtually no difference since the source image was black text on white background - but still you need this step to be able to work with a brightness threshold to transform it into a b/w image.
This is done by evaluating every single pixel in the image through its 8bit value - a good value to start trying is 128
which is 50% black:
img = img.point(lambda x: 0 if x < 128 else 255, '1')
Which gives us a text that is far too thin - The OCR tool will recognize most 5
as S
and some 0
's as O
:
Now setting the brightness threshold to 200
we get the following image:
The OCR tool can handle this text since it looks just like a bold font - but as stated previously, OCR tools aim to scan normally printed text, so chances are that it would fail to recognize actual bold text within the image, since it would be way too bold in comparison to normal text.
Let's set the threshold somewhere between 128
and 200
so that we get a naturally looking printed text - by doing a bit of trial and error i found 155
to be working great and make it look like the same font weight as in the original image:
Since this looks very much like a high-res scan of a poorly printed b/w document, the OCR tool can now do its job properly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With