I am trying to write a function that will take a jpg of a floorplan of a house and use OCR to extract the square footage that is written somewhere on the image
import requests
from PIL import Image
import pytesseract
import pandas as pd
import numpy as np
import cv2
import io
def floorplan_ocr(url):
""" a row-wise function to use pytesseract to scrape the word data from the floorplan
images, requires tesseract
to be installed https://github.com/tesseract-ocr/tesseract/wiki"""
if pd.isna(url):
return np.nan
res = ''
response = requests.get(url, stream=True)
if response.status_code == 200:
img = response.raw
img = np.asarray(bytearray(img.read()), dtype="uint8")
img = cv2.imdecode(img, cv2.CV_8UC1)
img = cv2.adaptiveThreshold(img,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,\
cv2.THRESH_BINARY,11,2)
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
del response
del img
else:
return np.nan
#print(res)
return res
However I am not getting much success. Only about 1 in 4 images actually outputs text that contains the square footage.
e.g currently
floorplan_ocr(https://i.imgur.com/9qwozIb.jpg)
outputs 'K\'Fréfiéfimmimmuuéé\n2|; apprnxx 135 max\nGArhaPpmxd1m max\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\nTOTAL APPaux noon AREA 523 so Fr, us. a 50. M )\nav .Wzms him "a! m m... mi unwary mmnmrmm mma y“ mum“;\n‘ wmduw: reams m wuhrmmm mm“ .m nanspmmmmy 3 mm :51\nmm" m mmm m; wan wmumw- mm my and mm mm as m by any\nwfmw PM” rmwm mm m .pwmwm m. mum mud ms nu mum.\n(.5 n: ma undammmw an we Ewen\nM vagw‘m Mewpkeem'
(and takes a long time to do it)
floorplan_ocr(https://i.imgur.com/sjxMpVp.jpg)
outputs ' '
.
I think some of the issues I am facing are:
I am stuck and am struggling to improve my results. All I want to extract is 'XXX sq ft' (and all the ways that might be written)
Is there a better way to do this?
Many thanks.
By applying these few lines to resize and change contrast/brightness on your second image, after cropping the bottom quarter of the image :
img = cv2.imread("download.jpg")
img = cv2.resize(img, (0, 0), fx=2, fy=2)
img = cv2.convertScaleAbs(img, alpha=1.2, beta=-40)
text = pytesseract.image_to_string(img, config='-l eng --oem 1 --psm 3')
i managed to get this result :
TOTAL APPROX. FLOOR AREA 528 SQ.FT. (49.0 SQ.M.)
Whilst every attempt has been made to ensure the accuracy of the floor plan contained here, measurements: of doors, windows, rooms and any other items are approximate and no responsibility ts taken for any error, omission, or mis-statement. This plan is for @ustrative purposes only and should be used as such by any prospective purchaser. The services, systems and appliances shown have not been tested and no guarantee a8 to the operability or efficiency can be given Made with Metropix ©2019
I did not treshold the image as your images structures vary from one another, and since the image is not only text, OTSU Thresholding does not find the right value.
To answer everything: Tesseract actually work best with grayscale image (black text on white background).
About the DPI/Resolution question, there is indeed some debate but there is also some empirical truth : DPI value doesn't really matters (since text size can vary for same DPI). For Tesseract OCR to work best, your characters need to be (edited :) 30-33 pixels (height), smaller by a few px can make Tesseract almost useless, and bigger characters actually reduce accuracy, though not significantly. (edit : found the source -> https://groups.google.com/forum/#!msg/tesseract-ocr/Wdh_JJwnw94/24JHDYQbBQAJ)
Finally, text format doesn't really change (at least in your examples). So your main problem here is text size, and the fact that you parse a whole page. If the text line you want is consistently at the bottom of the image, just extract (slice) your original image so you only feed Tesseract the relevent data, wich also will make it way faster.
EDIT : If you were also searching for a way to extract the square footage from your ocr'ed text :
text = "some place holder text 5471 square feet some more text"
# store here all the possible way it can be written
sqft_list = ["sq ft", "square feet", "sqft"]
extracted_value = ""
for sqft in sqft_list:
if sqft in text:
start = text.index(sqft) - 1
end = start + len(sqft) + 1
while text[start - 1] != " ":
start -= 1
extracted_value = text[start:end]
break
print(extracted_value)
5471 square feet
All of the pixelation around the text makes it harder for Tesseract to do its thing. I used a simple brightness/contrast algorithm from here to make the dots go away. I didn't do any thresholding/binarization. But I did have to scale the image to get any character recognition.
import pytesseract
import numpy as np
import cv2
img = cv2.imread('floor_original.jpg', 0) # read as grayscale
img = cv2.resize(img, (0,0), fx=2, fy=2) # scale image 2X
alpha = 1.2
beta = -20
img = cv2.addWeighted( img, alpha, img, 0, beta)
cv2.imwrite('output.png', img)
res = pytesseract.image_to_string(img, lang='eng', config='--remove-background')
print(res)
Edit There may be some platform/version dependence on above code. It runs on my Linux machine, but not on my Windows machine. To get it to run on Windows, I modified last two lines to
res = pytesseract.image_to_string(img, lang='eng', config='remove-background')
print(res.encode())
Output from tesseract(bolding added by me to emphasize sq footage):
TT xs?
IN
Approximate Gross Internal Area = 50.7 sq m / 546 sq ft
All dimensions are estimates only and may not be exact meas ent plans are subject lo change The sketches. renderngs graph matenala, lava, apectes
ne developer, the management company, the owners and other affiliates re rng oo all of ma ther sole discrebon and without enor scbioe
jements Araxs are approximate
Image after processing:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With