Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use pytesseract OCR to recognize text from an image

I need to use Pytesseract to extract text from this picture:

enter image description here

and the code:

from PIL import Image, ImageEnhance, ImageFilter import pytesseract path = 'pic.gif' img = Image.open(path) img = img.convert('RGBA') pix = img.load() for y in range(img.size[1]):     for x in range(img.size[0]):         if pix[x, y][0] < 102 or pix[x, y][1] < 102 or pix[x, y][2] < 102:             pix[x, y] = (0, 0, 0, 255)         else:             pix[x, y] = (255, 255, 255, 255) img.save('temp.jpg') text = pytesseract.image_to_string(Image.open('temp.jpg')) # os.remove('temp.jpg') print(text) 

and the "temp.jpg" is

enter image description here

Not bad, but the result of print is ,2 WW Not the right text2HHH, so how can I remove those black dots?

like image 973
Smith John Avatar asked Jun 10 '16 10:06

Smith John


People also ask

How do I use Tesseract to read text from an image?

Create a Python tesseract script Create a project folder and add a new main.py file inside that folder. Once the application gives access to PDF files, its content will be extracted in the form of images. These images will then be processed to extract the text.

What is Pytesseract image to string?

In this tutorial, we'll show you how to convert text from images into machine readable format with the help of the Python Pytesseract module. Pytesseract is a Python wrapper for Google's Tesseract library for OCR. With the help of Pytesseract, we'll be able to use Python to convert the words in an image to a string.


2 Answers

Here is my solution:

import pytesseract from PIL import Image, ImageEnhance, ImageFilter  im = Image.open("temp.jpg") # the second one  im = im.filter(ImageFilter.MedianFilter()) enhancer = ImageEnhance.Contrast(im) im = enhancer.enhance(2) im = im.convert('1') im.save('temp2.jpg') text = pytesseract.image_to_string(Image.open('temp2.jpg')) print(text) 
like image 161
Smith John Avatar answered Sep 21 '22 21:09

Smith John


Here's a simple approach using OpenCV and Pytesseract OCR. To perform OCR on an image, its important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. To do this, we can convert to grayscale, apply a slight Gaussian blur, then Otsu's threshold to obtain a binary image. From here, we can apply morphological operations to remove noise. Finally we invert the image. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.


Here's a visualization of the image processing pipeline:

Input image

enter image description here

Convert to grayscale -> Gaussian blur -> Otsu's threshold

enter image description here

Notice how there are tiny specs of noise, to remove them we can perform morphological operations

enter image description here

Finally we invert the image

enter image description here

Result from Pytesseract OCR

2HHH 

Code

import cv2 import pytesseract  pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"  # Grayscale, Gaussian blur, Otsu's threshold image = cv2.imread('1.png') gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) blur = cv2.GaussianBlur(gray, (3,3), 0) thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]  # Morph open to remove noise and invert image kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3)) opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1) invert = 255 - opening  # Perform text extraction data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6') print(data)  cv2.imshow('thresh', thresh) cv2.imshow('opening', opening) cv2.imshow('invert', invert) cv2.waitKey() 
like image 31
nathancy Avatar answered Sep 22 '22 21:09

nathancy