I'm trying to train a couple of neural networks (using tensorflow) to be able to extract semantic information from invoices. After a long list of reading I came up with this:
word2vec
to a CNN since vectors that are close together share similar semantic meanings.So the very high level approach I described above seems quite alright to me. I would love for it to be corrected if anything looks wrong.
A couple of concerns that I have:
order number: 12345
, assuming order number
is understood to be the invoice number (or whatever vectors that lie in the same vicinity of order number
), how do I extract the value 12345
? One area I was looking at is SyntaxNet that could help here.Any help/insight is appreciated.
Follow up to @wasi-ahmad's question: The reason I'm trying to understand semantic information about an invoice is to ultimately be able to extract values out of it. So, for instance, if I present an unseen invoice to my neural network it would find the invoice's number (whatever its label is called) and extract its value.
Information extraction is the process of extracting information from unstructured textual sources to enable finding entities as well as classifying and storing them in a database.
Information extraction is concerned with applying natural language processing to automatically extract the essential details from text documents. A great disadvantage of current approaches is their intrinsic dependence to the application domain and the target language.
Extraction means “pulling out” and Retrieval means “getting back.” Information retrieval is about returning the information that is relevant for a specific query or field of interest of the user.
If you have big dataset of invoices, its better you use that. Dataset has some obvious impact on word embeddings construction. To construct the corpus, you can remove common stop words (like a, the etc.) and then use tf-idf weight of each word to represent a document before feeding them to a skip-gram
or CBOW
model. You can also use one-hot encoding
as an alternative of tf-idf weight. You can also think about simple language model (using bigram or trigram) since you have a very specific domain to work on. This would make your model more simpler!
Your second concern is not clear to me! Usually numeric values are replaced by some label, say NUM
during pre-processing step in information extraction task. However, SyntaxNet
is actually for dependency parsing. Since your ultimate goal is to extract semantic meaning from invoices, why you need syntactical information? Is it going to help you in this task? If you have a large dataset, you can generate a dictionary for the specific targeted domain. But it depends on how you are going to use the extracted semantic information which you didn't mention in your post!
This is my personal opinion (not aiming to criticize you) that use of word embedding or neural network based models everywhere is not feasible. Word embedding or neural network based approaches give good performance in return of heavy computational complexity. So, if you can serve your purpose with a simple and efficient model, why would you prefer a complex and computationally expensive model? You must have very good reasoning about your chosen model. It is not a wise decision to use a model only because the model is popular and widely used.
I am assuming this is a straight forward extraction problem for invoices. You are proposing a way more complex solution than probably needed--I don't really see how it could work but I don't know everything. Let's step back and start simple:
1) Take at least one example of each type of invoice you expect to process and mark it up with xml like tags that mirror the goal extraction e.g. "Order Number: 12445". XML or other parser can grab it later for the evaluation step if needed or post processing.
2) Think of the simplest way to extract the information--I suggest you start with a regular expression matcher.
3) If a regex matcher is insufficient then you may need some supervised machine learning. This will be able to get more varied phrases and can perform at a very high level of precision and recall for the right phrases. See http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html for a bunch of approaches.
4) If you need more than phrase matching--e.g. part number and part count then you may need to top the stack with a classifier that decides whether the concurrence is legit or not.
Hope that helps.
Breck
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With