Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Machine Learning - Information extraction from a document [closed]

I'm trying to train a couple of neural networks (using tensorflow) to be able to extract semantic information from invoices. After a long list of reading I came up with this:

  • Use word2vec to generate word embeddings (more on the corpus below).
  • Feed the output of word2vec to a CNN since vectors that are close together share similar semantic meanings.

So the very high level approach I described above seems quite alright to me. I would love for it to be corrected if anything looks wrong.

A couple of concerns that I have:

  1. Corpus selection. Is it sufficient to use a generic corpus of, for instance, wikipedia? Or should I use a specialized corpus for invoices? If it's the latter, how can I generate this corpus? I do have a big dataset of invoices that I can utilize.
  2. Information extraction. Let's say all of the above work fine and I'm able to understand semantic information from a new unseen invoice. How do I go about extracting certain pieces of information? For instance, let's say we introduce a new invoice that has order number: 12345, assuming order number is understood to be the invoice number (or whatever vectors that lie in the same vicinity of order number), how do I extract the value 12345? One area I was looking at is SyntaxNet that could help here.

Any help/insight is appreciated.

Follow up to @wasi-ahmad's question: The reason I'm trying to understand semantic information about an invoice is to ultimately be able to extract values out of it. So, for instance, if I present an unseen invoice to my neural network it would find the invoice's number (whatever its label is called) and extract its value.

like image 999
Aziz Alfoudari Avatar asked Nov 22 '16 21:11

Aziz Alfoudari


People also ask

What is the process of extracting information from a document?

Information extraction is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database.

What is information extraction in machine learning?

Information extraction is concerned with applying natural language processing to automatically extract the essential details from text documents. A great disadvantage of current approaches is their intrinsic dependence to the application domain and the target language.

What is the difference between information retrieval and information extraction?

Extraction means “pulling out” and Retrieval means “getting back.” Information retrieval is about returning the information that is relevant for a specific query or field of interest of the user.


2 Answers

  1. If you have big dataset of invoices, its better you use that. Dataset has some obvious impact on word embeddings construction. To construct the corpus, you can remove common stop words (like a, the etc.) and then use tf-idf weight of each word to represent a document before feeding them to a skip-gram or CBOW model. You can also use one-hot encoding as an alternative of tf-idf weight. You can also think about simple language model (using bigram or trigram) since you have a very specific domain to work on. This would make your model more simpler!

  2. Your second concern is not clear to me! Usually numeric values are replaced by some label, say NUM during pre-processing step in information extraction task. However, SyntaxNet is actually for dependency parsing. Since your ultimate goal is to extract semantic meaning from invoices, why you need syntactical information? Is it going to help you in this task? If you have a large dataset, you can generate a dictionary for the specific targeted domain. But it depends on how you are going to use the extracted semantic information which you didn't mention in your post!

This is my personal opinion (not aiming to criticize you) that use of word embedding or neural network based models everywhere is not feasible. Word embedding or neural network based approaches give good performance in return of heavy computational complexity. So, if you can serve your purpose with a simple and efficient model, why would you prefer a complex and computationally expensive model? You must have very good reasoning about your chosen model. It is not a wise decision to use a model only because the model is popular and widely used.

like image 187
Wasi Ahmad Avatar answered Oct 16 '22 19:10

Wasi Ahmad


I am assuming this is a straight forward extraction problem for invoices. You are proposing a way more complex solution than probably needed--I don't really see how it could work but I don't know everything. Let's step back and start simple:

1) Take at least one example of each type of invoice you expect to process and mark it up with xml like tags that mirror the goal extraction e.g. "Order Number: 12445". XML or other parser can grab it later for the evaluation step if needed or post processing.

2) Think of the simplest way to extract the information--I suggest you start with a regular expression matcher.

3) If a regex matcher is insufficient then you may need some supervised machine learning. This will be able to get more varied phrases and can perform at a very high level of precision and recall for the right phrases. See http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html for a bunch of approaches.

4) If you need more than phrase matching--e.g. part number and part count then you may need to top the stack with a classifier that decides whether the concurrence is legit or not.

Hope that helps.

Breck

like image 2
Breck Baldwin Avatar answered Oct 16 '22 19:10

Breck Baldwin