Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Real time OCR in python

The problem

Im trying to capture my desktop with OpenCV and have Tesseract OCR find text and set it as a variable, for example, if I was going to play a game and have the capturing frame over a resource amount, I want it to print that and use it. A perfect example of this is a video by Micheal Reeves where whenever he loses health in a game it shows it and sends it to his Bluetooth enabled airsoft gun to shoot him. So far I have this:

# imports
from PIL import ImageGrab
from PIL import Image
import numpy as np
import pytesseract
import argparse
import cv2
import os

fourcc = cv2.VideoWriter_fourcc(*'XVID')
out = cv2.VideoWriter("output.avi", fourcc, 5.0, (1366, 768))

while(True):
        x = 760
        y = 968

        ox = 50
        oy = 22

        # screen capture
        img = ImageGrab.grab(bbox=(x, y, x + ox, y + oy))
        img_np = np.array(img)
        frame = cv2.cvtColor(img_np, cv2.COLOR_BGR2RGB)
        cv2.imshow("Screen", frame)
        out.write(frame)

        if cv2.waitKey(1) == 0:
                break

out.release()
cv2.destroyAllWindows()

it captures real-time and displays it in a window but I have no clue how to make it recognise the text every frame and output it.

any help?

like image 518
Peter Avatar asked Oct 19 '18 20:10

Peter


2 Answers

It's fairly simple to grab the screen and pass it to tesseract for OCRing.

The PIL (pillow) library can grab the frames easily on MacOS and Windows. However, this feature has only recently been added for Linux, so the code below works around it not existing. (I'm on Ubuntu 19.10 and my Pillow does not support it).

Essentially the user starts the program with screen-region rectangle co-ordinates. The main loop continually grabs this area of the screen, feeding it to Tesseract. If Tesseract finds any non-whitespace text in that image, it is written to stdout.

Note that this is not a proper Real Time system. There is no guarantee of timeliness, each frame takes as long as it takes. Your machine might get 60 FPS or it might get 6. This will also be greatly influenced by the size of the rectangle your ask it to monitor.

#! /usr/bin/env python3

import sys
import pytesseract
from PIL import Image

# Import ImageGrab if possible, might fail on Linux
try:
    from PIL import ImageGrab
    use_grab = True
except Exception as ex:
    # Some older versions of pillow don't support ImageGrab on Linux
    # In which case we will use XLib 
    if ( sys.platform == 'linux' ):
        from Xlib import display, X   
        use_grab = False
    else:
        raise ex


def screenGrab( rect ):
    """ Given a rectangle, return a PIL Image of that part of the screen.
        Handles a Linux installation with and older Pillow by falling-back
        to using XLib """
    global use_grab
    x, y, width, height = rect

    if ( use_grab ):
        image = PIL.ImageGrab.grab( bbox=[ x, y, x+width, y+height ] )
    else:
        # ImageGrab can be missing under Linux
        dsp  = display.Display()
        root = dsp.screen().root
        raw_image = root.get_image( x, y, width, height, X.ZPixmap, 0xffffffff )
        image = Image.frombuffer( "RGB", ( width, height ), raw_image.data, "raw", "BGRX", 0, 1 )
        # DEBUG image.save( '/tmp/screen_grab.png', 'PNG' )
    return image


### Do some rudimentary command line argument handling
### So the user can speicify the area of the screen to watch
if ( __name__ == "__main__" ):
    EXE = sys.argv[0]
    del( sys.argv[0] )

    # EDIT: catch zero-args
    if ( len( sys.argv ) != 4 or sys.argv[0] in ( '--help', '-h', '-?', '/?' ) ):  # some minor help
        sys.stderr.write( EXE + ": monitors section of screen for text\n" )
        sys.stderr.write( EXE + ": Give x, y, width, height as arguments\n" )
        sys.exit( 1 )

    # TODO - add error checking
    x      = int( sys.argv[0] )
    y      = int( sys.argv[1] )
    width  = int( sys.argv[2] )
    height = int( sys.argv[3] )

    # Area of screen to monitor
    screen_rect = [ x, y, width, height ]  
    print( EXE + ": watching " + str( screen_rect ) )

    ### Loop forever, monitoring the user-specified rectangle of the screen
    while ( True ): 
        image = screenGrab( screen_rect )              # Grab the area of the screen
        text  = pytesseract.image_to_string( image )   # OCR the image

        # IF the OCR found anything, write it to stdout.
        text = text.strip()
        if ( len( text ) > 0 ):
            print( text )

This answer was cobbled together from various other answers on SO.

If you use this answer for anything regularly, it would be worth adding a rate-limiter to save some CPU. It could probably sleep for half a second every loop.

like image 50
Kingsley Avatar answered Sep 25 '22 05:09

Kingsley


Tesseract is a single-use command-line application using files for input and output, meaning every OCR call creates a new process and initializes a new Tesseract engine, which includes reading multi-megabyte data files from disk. Its suitability as a real-time OCR engine will depend on the exact use case—more pixels requires more time—and which parameters are provided to tune the OCR engine. Some experimentation may ultimately be required to tune the engine to the exact scenario, but also expect the time required to OCR for a frame may exceed the frame time and a reduction in the frequency of OCR execution may be required, i.e. performing OCR at 10-20 FPS rather than 60+ FPS the game may be running at.

In my experience, a reasonably complex document in a 2200x1700px image can take anywhere from 0.5s to 2s using the english fast model with 4 cores (the default) on an aging CPU, however this "complex document" represents the worst-case scenario and makes no assumptions on the structure of the text being recognized. For many scenarios, such as extracting data from a game screen, assumptions can be made to implement a few optimizations and speed up OCR:

  • Reduce the size of the input image. When extracting specific information from the screen, crop the grabbed screen image as much as possible to only that information. If you're trying to extract a value like health, crop the image around just the health value.
  • Use the "fast" trained models to improve speed at the cost of accuracy. You can use the -l option to specify different models and the --testdata-dir option to specify the directory containing your model files. You can download multiple models and rename the files to "eng_fast.traineddata", "eng_best.traineddata", etc.
  • Use the --psm parameter to prevent page segmentation not required for your scenario. --psm 7 may be the best option for singular pieces of information, but play around with different values and find which works best.
  • Restrict the allowed character set if you know which characters will be used, such as if you're only looking for numerics, by changing the whitelist configuration value: -c tessedit_char_whitelist='1234567890'.

pytesseract is the best way to get started with implementing Tesseract, and the library can handle image input directly (although it saves the image to a file before passing to Tesseract) and pass the resulting text back using image_to_string(...).

import pytesseract

# Capture frame...

# If the frame requires cropping:
frame = frame[y:y + h, x:x + w] 

# Perform OCR
text = pytesseract.image_to_string(frame, lang="eng_fast" config="--psm 7")

# Process the result
health = int(text)
like image 30
Enigma Avatar answered Sep 21 '22 05:09

Enigma