I am working on a project where I have processes and stored documents of Single-page Medical Reports with Labelled Categories. The user will input one document and I have to classify which category it belongs to.
I have converted all documents to grayscaled image format and stored for comparison purposes.
I have a dataset of images having following data,
image_path
: This column has a path to the imagehistogram_value
: This column has a histogram of the image, calculated using cv2.calcHist
functionnp_avg
: This column has an average value of all pixel of the image. Calculated using np.average
category
: This column is a category of the image.I am planning to use these two methods,
histogram_value
of the input image, find nearest 10 matching images
np_avg
of the input image, find nearest 10 matching imagesI have very little knowledge in the Image Processing domain. Will the above mechanism is reliable for my purpose?
I check SO, found few questions on same but they have a very different problem and desired outcome. This question looks similar to my situation but it's very generic and I am not sure it will work in my scenario.
Link to sample reports
I'd recommend a few things:
Text Based Comparison:
OCR the Documents and extract Text Features using Google's Tesseract which is one of the best open source OCR packages out there. There is also a Python Wrapper for it called PyTesseract. You'll likely need to play with the resolution of your images for the OCR to work to your satisfaction - this will require some trial and error.
Once you have extracted the words one of the commonly accepted approaches is to calculate a TF-IDF (Term Frequency - Inverse Document Frequency) and then any distance based approaches (cosine similarity is one of the common ones) to compare which documents are "similar" (closer) to each other.
Image Based Comparison
If you already have the images as a vector then apply a distance based measure to figure out similarity. Generally L1 or L2 norm would work. This paper suggests that Manhattan (L1 Norm) might work better for natural images. You could start with that and try other distance based measures
Ensemble Text and Image Based Comparisons
Run both approaches and then take some averaging between the two approaches to arrive at documents that are similar to each other.
For e.g.
The Text Based Approach might rank DocB and DocC as the closest 2 documents to DocA by Distance 10 and 20 units respectively.
Image Based Approach might rank DocC and DocB as the closest two by Distance 5 and Distance 20 respectively.
Then you can average the two distances. DocB would be (10+20)/2 = 15 and DocC would be (20+5)/2 = 12.5 units apart from DocA. So you'll treat DocC to be closer to A than B in an ensembled approach.
Measuring similarity of documents from images is complicated compared to measuring documents from texts for two reasons.
My solution is using machine learning to find representations of a document and use this representation to classify the document. Here I will give Keras implementation of the solution I propose.
I propose using convolutional layers for feature extraction followed by recurrent layers for sequence classification. I've chosen keras because of my familiarity and it has simple API to define a network with a combination of convolutional layers and recurrent layers. But the code can be easily changed to other libraries such as Pytorch, Tensorflow, etc.
There are many ways to pre-process the images of documents for neural networks. I'm making the assumptions.
Split the images vertically so that that the lines are feed as sequences(It is more efficient if the splitting line could be made on empty line). I will show this using Numpy for a single document. In the following implementation, I assume the image shape of a single document is (100, 100, 3). First, let's define the image_shape the shape of document images as
import numpy as np
image_shape = (100, 100, 3)
split_size = 25 # this should be factor of the image_shape[0]
doc_images = [] #
doc_image = np.zeros(image_shape)
splitted_images = np.split(doc_image,[split_size], axis=0)
doc_images.extend(splitted_images)
doc_images = np.array(doc_images)
Keras have ConvLSTM2D layer to deal with sequential images. The inputs to the network are lists of a sequence of images produced by splitting document images.
from keras.models import Sequential
from keras.layers import ConvLSTM2D, Dense, Flatten
num_of_classes = 10
model = Sequential()
model.add(ConvLSTM2D(32,(3, 3),input_shape=(None, split_size, image_shape[1],image_shape[2]),
padding='same',
return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=False))
model.add(Flatten())
model.add(Dense(1024, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))
Ideally this model will work because the model might learn hierarchical representation(characters, words, sentences, contexts, symbols) of document from it's image.
Sample documents vary greatly, impossible to compare them on image level (histogram, np_avg).
Contents of reports are multiple numeric (min,max,recommended) or category results (Negative/Positive).
For each type of report, you'd have to do preprocessing.
If documents source is digital (not scanned) you do extraction and comparison of fields, rows. Each row separately.
If documents are scanned, you have to deal with rotation of image, quality and artifacts - before extraction.
Each type of report is problem on its own. Pick one type of report with multiple samples for start.
Since you are dealing with numbers, only with extraction to text and numbers you'll have good results. If report says value is 0.2 and tolerated range is between 0.1 and 0.3, NN is not tool for that. You have to compare numbers.
NNs are not best tool for this, at least not for comparing values. Maybe for part of extraction process.
Steps to solution
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With