Finding Similar Document

Question

I am working on a project where I have processes and stored documents of Single-page Medical Reports with Labelled Categories. The user will input one document and I have to classify which category it belongs to.

I have converted all documents to grayscaled image format and stored for comparison purposes.

I have a dataset of images having following data,

image_path: This column has a path to the image
histogram_value: This column has a histogram of the image, calculated using cv2.calcHist function
np_avg: This column has an average value of all pixel of the image. Calculated using np.average
- category: This column is a category of the image.

I am planning to use these two methods,

Calculate histogram_value of the input image, find nearest 10 matching images
- Calculate np_avg of the input image, find nearest 10 matching images
- Take intersect of both result set
- If more than one image found, do template matching to find the best fit.

I have very little knowledge in the Image Processing domain. Will the above mechanism is reliable for my purpose?

I check SO, found few questions on same but they have a very different problem and desired outcome. This question looks similar to my situation but it's very generic and I am not sure it will work in my scenario.

Link to sample reports

HakunaMaData · Accepted Answer

I'd recommend a few things:

Text Based Comparison:

OCR the Documents and extract Text Features using Google's Tesseract which is one of the best open source OCR packages out there. There is also a Python Wrapper for it called PyTesseract. You'll likely need to play with the resolution of your images for the OCR to work to your satisfaction - this will require some trial and error.

Once you have extracted the words one of the commonly accepted approaches is to calculate a TF-IDF (Term Frequency - Inverse Document Frequency) and then any distance based approaches (cosine similarity is one of the common ones) to compare which documents are "similar" (closer) to each other.

Image Based Comparison

If you already have the images as a vector then apply a distance based measure to figure out similarity. Generally L1 or L2 norm would work. This paper suggests that Manhattan (L1 Norm) might work better for natural images. You could start with that and try other distance based measures

Ensemble Text and Image Based Comparisons

Run both approaches and then take some averaging between the two approaches to arrive at documents that are similar to each other.

For e.g.

The Text Based Approach might rank DocB and DocC as the closest 2 documents to DocA by Distance 10 and 20 units respectively.

Image Based Approach might rank DocC and DocB as the closest two by Distance 5 and Distance 20 respectively.

Then you can average the two distances. DocB would be (10+20)/2 = 15 and DocC would be (20+5)/2 = 12.5 units apart from DocA. So you'll treat DocC to be closer to A than B in an ensembled approach.

Mitiku · Answer

Measuring similarity of documents from images is complicated compared to measuring documents from texts for two reasons.

The images could have similarity in terms of brightness, textual context, diagrams or symbols.
It is often harder to find the representation of the document from images it contains compared with its textual information.

solution

My solution is using machine learning to find representations of a document and use this representation to classify the document. Here I will give Keras implementation of the solution I propose.

Network type

I propose using convolutional layers for feature extraction followed by recurrent layers for sequence classification. I've chosen keras because of my familiarity and it has simple API to define a network with a combination of convolutional layers and recurrent layers. But the code can be easily changed to other libraries such as Pytorch, Tensorflow, etc.

Images pre-processing

There are many ways to pre-process the images of documents for neural networks. I'm making the assumptions.

Images contain horizontal text rather than vertical texts.
The document images size is fixed. If the images size is not fixed it can be resized using opencv's resize method.

Split the images vertically so that that the lines are feed as sequences(It is more efficient if the splitting line could be made on empty line). I will show this using Numpy for a single document. In the following implementation, I assume the image shape of a single document is (100, 100, 3). First, let's define the image_shape the shape of document images as

import numpy as np
image_shape = (100, 100, 3)
split_size = 25 # this should be factor of the image_shape[0]
doc_images = [] #
doc_image = np.zeros(image_shape)

splitted_images = np.split(doc_image,[split_size], axis=0)
doc_images.extend(splitted_images)
doc_images = np.array(doc_images)

The network implementation

Keras have ConvLSTM2D layer to deal with sequential images. The inputs to the network are lists of a sequence of images produced by splitting document images.

from keras.models import Sequential
from keras.layers import ConvLSTM2D, Dense, Flatten
num_of_classes = 10
model = Sequential()

model.add(ConvLSTM2D(32,(3, 3),input_shape=(None, split_size, image_shape[1],image_shape[2]),
        padding='same',
        return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=True))
model.add(ConvLSTM2D(32,(3, 3),padding='same',return_sequences=False))
model.add(Flatten())
model.add(Dense(1024, activation="relu"))
model.add(Dense(num_classes, activation="softmax"))

Ideally this model will work because the model might learn hierarchical representation(characters, words, sentences, contexts, symbols) of document from it's image.

dario · Answer

Sample documents vary greatly, impossible to compare them on image level (histogram, np_avg).

Contents of reports are multiple numeric (min,max,recommended) or category results (Negative/Positive).

For each type of report, you'd have to do preprocessing.

If documents source is digital (not scanned) you do extraction and comparison of fields, rows. Each row separately.

extraction to image part of field or row and comparing it with NN
extraction to text and comparing values (OCR)

If documents are scanned, you have to deal with rotation of image, quality and artifacts - before extraction.

Each type of report is problem on its own. Pick one type of report with multiple samples for start.

Since you are dealing with numbers, only with extraction to text and numbers you'll have good results. If report says value is 0.2 and tolerated range is between 0.1 and 0.3, NN is not tool for that. You have to compare numbers.

NNs are not best tool for this, at least not for comparing values. Maybe for part of extraction process.

Steps to solution

automate categorization of reports
for each type of report mark fields with data
for each type of report automate extraction of values
for each type of report interpret values according to business rules

Finding Similar Document

Tags:

python

opencv

numpy

document

Gaurav Gandhi

3 Answers

HakunaMaData

solution

Network type

Images pre-processing

The network implementation

Mitiku

dario

Recent Activity

Donate For Us

Finding Similar Document

Tags:

python

opencv

numpy

document

Gaurav Gandhi

3 Answers

HakunaMaData

solution

Network type

Images pre-processing

The network implementation

Mitiku

dario

Related questions

Recent Activity

Donate For Us