Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to ocr as much text as possible from video game screenshots?

I'm trying to use the tesseract ocr tool to extract ocr text from video games(I'm pre processing screenshots and passing them to command line tool tsv output and parsing that).

I'd like to use it for test automation not unlike selenium web testing. That is I'd like to be able to wait for elements to appear instead of sleeping and clicking on buttons(mostly menus).

To be able to do that I need to be able to consistently find the same button text and find as much text as possible against a range of video games. For the sake of abstraction I'd prefer the pre processing/tesseract options to be the same for every game.

I can probably add a dictionary of each word encountered in each game but I'd prefer not to.

I've got a setup where I can test a number of different combinations of pre-processing/tesseract options and see the resulting words.

I'm already tried blowing up the screenshot (which is 70-90 dpi) 5x times, and making it greyscale before passing it to tesarect.

What other techniques can I use to improve the number and accuracy of my results? Which tesseract knobs should I be looking at? Is there any other useful pre-processing I can add?

P.S. I'm finding that if I enlarge the picture to be twice as long/wide tesseract blows up seemingly because it runs out of memory for the image. Is there a static limit? Can I find it so I can blow up the image near max size? Can I adjust it?

like image 200
Roman A. Taycher Avatar asked May 04 '18 07:05

Roman A. Taycher


People also ask

Can you extract text from an image?

Extract text from images on Android. There are many apps for Android that let you convert images to text. Not only that, but you can also scan text on the go as all Android phones have built-in cameras. Text Scanner is my favorite Android OCR app as it lets you extract text from images offline.

Can OCR Scan images?

What Is OCR? Optical character recognition (OCR) software converts pictures, or even handwriting, into text. OCR tools analyze a document and compare it with fonts stored in their database, and/or by noting features typical of characters.

Is Tesseract free?

Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License.


1 Answers

Train your own tessdata

This is by far the most important lesson learned from my experience with tesseract. Out of the box tesseract works really well with recognizing scanned book and newspaper text, but when you try using it with a font that is not similar to standard book and newspaper fonts (like Times New Roman) in my experience accuracy decreases significantly. Training used to be much more difficult, but nowadays tesstrain.sh makes it a cinch. You will have to gather up your video game fonts (or ones that look similar to them at least) and provide them as input to the training script. Even if your fonts are widely different, tesseract will be able to choose the right font for the provided image at runtime with amazing accuracy. Also, I know it's tedious, but it would be beneficial to provide the wordlist of all words encountered in the video games to the training script. Training tesseract with your own fonts and your own wordlist will give you near-perfect accuracy without doing much of anything else.

Preprocess image to recognize

Dont rely on tesseract's layout analysis

If you can, do your own layout analysis and crop the image to the parts containing the text. Teseract has a page segmentation engine built-in but it has to cover such a broad range of use cases that it most likely will not work for your particular needs. Also, in my experience it further helps with accuracy if you separate the image out into single lines of text and use the segmentation mode 7 (Treat the image as a single text line).

Bump up x-height of input text

It helps if you increase the x-height of the input text to the same height you used to train tesseract (IIRC this was 70 pixels in my case).

Bump up DPI of input text

Tesseract really likes 300 DPI. Note that changing the DPI of an image is not the same as changing its size. (for example, with ImageMagick you would use the -density option to change an image's DPI).

Tesseract configuration variables to use

In my experience, tweaking the different "penalty" settings having to do with matching dictionary words had the most impact on improving accuracy. The settings that worked for me:

language_model_penalty_non_dict_word      0.975
language_model_penalty_non_freq_dict_word 0.575
segment_penalty_dict_case_bad             1.3125
segment_penalty_dict_case_ok              1.1
segment_penalty_dict_nonword              10.25

But you should obviously do your own tweaking. Also, I found that the x-height settings were very useful at runtime: textord_min_xheight and min_sane_x_ht_pixels.


I am not aware of any memory size limits on tesseract. Are you perhaps using tesseract through a wrapper that has its own limits?


Note: this answer is assuming you're using the latest stable build of tesseract, which would be tesseract 3.05. If you're using tesseract 4.0, doing your own training and segmentation would still apply but the other sections of the answer may be OBE.

like image 177
mnistic Avatar answered Oct 12 '22 15:10

mnistic