How to recognize entities in text that is the output of optical character recognition (OCR)?

Question

I am trying to do multi-class classification with textual data. Problem I am facing that I have unstructured textual data. I'll explain the problem with an example. consider this image for example:

example data

I want to extract and classify text information given in image. Problem is when I extract information OCR engine will give output something like this:

18
EURO 46
KEEP AWAY
FROM FIRE
MADE IN CHINA
2226249917581
7412501
DOROTHY
PERKINS

Now target classes here are:

18 -> size
EURO 46 -> price
KEEP AWAY FROM FIRE -> usage_instructions
MADE IN CHINA -> manufacturing_location
2226249917581 -> product_id
7412501 -> style_id
DOROTHY PERKINS -> brand_name

Problem I am facing is that input text is not separable, meaning "multiple lines can belong to same class" and there can be cases where "single line can have multiple classes".

So I don't know how I can split/merge lines before passing it to classification model.
Is there any way using NLP I can split paragraph based on target class. In other words given input paragraph split it based on target labels.

amirouche · Accepted Answer

If you only consider the text, this is a Named Entity Recognition (NER) task.

What you can do is train a Spacy model to NER for your particular problem.

Here is what you will need to do:

First gather a list of training text data
Label that data with corresponding entity types
Split the data into training set and testing set
Train a model with Spacy NER using training set
Score the model using the testing set
...
Profit!

See Spacy documentation on training specific NER models

Good luck!

How to recognize entities in text that is the output of optical character recognition (OCR)?

Tags:

nlp

recurrent-neural-network

named-entity-recognition

text-classification

named-entity-extraction

Vivek Mehta

1 Answers

amirouche

Recent Activity

Donate For Us

How to recognize entities in text that is the output of optical character recognition (OCR)?

Tags:

nlp

recurrent-neural-network

named-entity-recognition

text-classification

named-entity-extraction

Vivek Mehta

1 Answers

amirouche

Related questions

Recent Activity

Donate For Us