Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I use the Keras OCR example?

Tags:

I found examples/image_ocr.py which seems to for OCR. Hence it should be possible to give the model an image and receive text. However, I have no idea how to do so. How do I feed the model with a new image? Which kind of preprocessing is necessary?

What I did

Installing the depencencies:

  • Install cairocffi: sudo apt-get install python-cairocffi
  • Install editdistance: sudo -H pip install editdistance
  • Change train to return the model and save the trained model.
  • Run the script to train the model.

Now I have a model.h5. What's next?

See https://github.com/MartinThoma/algorithms/tree/master/ML/ocr/keras for my current code. I know how to load the model (see below) and this seems to work. The problem is that I don't know how to feed new scans of images with text to the model.

Related side questions

  • What is CTC? Connectionist Temporal Classification?
  • Are there algorithms which reliably detect the rotation of a document?
  • Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?

What I tried

#!/usr/bin/env python  from keras import backend as K import keras from keras.models import load_model import os  from image_ocr import ctc_lambda_func, create_model, TextImageGenerator from keras.layers import Lambda from keras.utils.data_utils import get_file import scipy.ndimage import numpy  img_h = 64 img_w = 512 pool_size = 2 words_per_epoch = 16000 val_split = 0.2 val_words = int(words_per_epoch * (val_split)) if K.image_data_format() == 'channels_first':     input_shape = (1, img_w, img_h) else:     input_shape = (img_w, img_h, 1)  fdir = os.path.dirname(get_file('wordlists.tgz',                                 origin='http://www.mythic-ai.com/datasets/wordlists.tgz', untar=True))  img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'),                              bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'),                              minibatch_size=32,                              img_w=img_w,                              img_h=img_h,                              downsample_factor=(pool_size ** 2),                              val_split=words_per_epoch - val_words                              ) print("Input shape: {}".format(input_shape)) model, _, _ = create_model(input_shape, img_gen, pool_size, img_w, img_h)  model.load_weights("my_model.h5")  x = scipy.ndimage.imread('example.png', mode='L').transpose() x = x.reshape(x.shape + (1,))  # Does not work print(model.predict(x)) 

this gives

2017-07-05 22:07:58.695665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:01:00.0) Traceback (most recent call last):   File "eval_example.py", line 45, in <module>     print(model.predict(x))   File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1567, in predict     check_batch_axis=False)   File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 106, in _standardize_input_data     'Found: array with shape ' + str(data.shape)) ValueError: The model expects 4 arrays, but only received one array. Found: array with shape (512, 64, 1) 
like image 344
Martin Thoma Avatar asked Jun 30 '17 13:06

Martin Thoma


1 Answers

Well, I will try to answer everything you asked here:

As commented in the OCR code, Keras doesn't support losses with multiple parameters, so it calculated the NN loss in a lambda layer. What does this mean in this case?

The neural network may look confusing because it is using 4 inputs ([input_data, labels, input_length, label_length]) and loss_out as output. Besides input_data, everything else is information used only for calculating the loss, it means it is only used for training. We desire something like in line 468 of the original code:

Model(inputs=input_data, outputs=y_pred).summary() 

which means "I have an image as input, please tell me what is written here". So how to achieve it?

1) Keep the original training code as it is, do the training normally;

2) After training, save this model Model(inputs=input_data, outputs=y_pred)in a .h5 file to be loaded wherever you want;

3) Do the prediction: if you take a look at the code, the input image is inverted and translated, so you can use this code to make it easy:

from scipy.misc import imread, imresize #use width and height from your neural network here.  def load_for_nn(img_file):     image = imread(img_file, flatten=True)     image = imresize(image,(height, width))     image = image.T      images = np.ones((1,width,height)) #change 1 to any number of images you want to predict, here I just want to predict one     images[0] = image     images = images[:,:,:,np.newaxis]     images /= 255      return images 

With the image loaded, let's do the prediction:

def predict_image(image_path): #insert the path of your image      image = load_for_nn(image_path) #load from the snippet code     raw_word = model.predict(image) #do the prediction with the neural network     final_word = decode_output(raw_word)[0] #the output of our neural network is only numbers. Use decode_output from image_ocr.py to get the desirable string.     return final_word 

This should be enough. From my experience, the images used in the training are not good enough to make good predictions, I will release a code using other datasets that improved my results later if necessary.

Answering related questions:

  • What is CTC? Connectionist Temporal Classification?

It is a technique used to improve sequence classification. The original paper proves it improves results on discovering what is said in audio. In this case it is a sequence of characters. The explanation is a bit trick but you can find a good one here.

  • Are there algorithms which reliably detect the rotation of a document?

I am not sure but you could take a look at Attention mechanism in neural networks. I don't have any good link now but I know it could be the case.

  • Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?

OpenCV implements Maximally Stable Extremal Regions (known as MSER). I really like the results of this algorithm, it is fast and was good enough for me when I needed.

As I said before, I will release a code soon. I will edit the question with the repository when I do, but I believe the information here is enough to get the example running.

like image 53
Claudio Avatar answered Oct 02 '22 20:10

Claudio