I'm trying to train a couple of neural networks (using tensorflow) to be able to extract semantic information from invoices. After a long list of reading I came up with this: <ul> <li>Use word2vec to generate word embeddings (more on the corpus below).</li> <li>Feed the output of <code>word2vec</code> to a CNN since vectors that are close together share similar semantic meanings.</li> </ul> So the very high level approach I described above seems quite alright to me. I would love for it to be corrected if anything looks wrong. A couple of concerns that I have: <ol> <li>Corpus selection. Is it sufficient to use a generic corpus of, for instance, wikipedia? Or should I use a specialized corpus for invoices? If it's the latter, how can I generate this corpus? I do have a big dataset of invoices that I can utilize.</li> <li>Information extraction. Let's say all of the above work fine and I'm able to understand semantic information from a new unseen invoice. How do I go about extracting certain pieces of information? For instance, let's say we introduce a new invoice that has <code>order number: 12345</code>, assuming <code>order number</code> is understood to be the invoice number (or whatever vectors that lie in the same vicinity of <code>order number</code>), how do I extract the value <code>12345</code>? One area I was looking at is SyntaxNet that could help here.</li> </ol> Any help/insight is appreciated. Follow up to @wasi-ahmad's question: The reason I'm trying to understand semantic information about an invoice is to ultimately be able to extract values out of it. So, for instance, if I present an unseen invoice to my neural network it would find the invoice's number (whatever its label is called) and extract its value.

<ol> <li>If you have big dataset of invoices, its better you use that. Dataset has some obvious impact on word embeddings construction. To construct the corpus, you can remove common stop words (like a, the etc.) and then use tf-idf weight of each word to represent a document before feeding them to a <code>skip-gram</code> or <code>CBOW</code> model. You can also use <code>one-hot encoding</code> as an alternative of tf-idf weight. You can also think about simple language model (using bigram or trigram) since you have a very specific domain to work on. This would make your model more simpler!</li> <li>Your second concern is not clear to me! Usually numeric values are replaced by some label, say <code>NUM</code> during pre-processing step in information extraction task. However, <code>SyntaxNet</code> is actually for dependency parsing. Since your ultimate goal is to extract semantic meaning from invoices, why you need syntactical information? Is it going to help you in this task? If you have a large dataset, you can generate a dictionary for the specific targeted domain. But it depends on how you are going to use the extracted semantic information which you didn't mention in your post!</li> </ol> This is my personal opinion (not aiming to criticize you) that use of word embedding or neural network based models everywhere is not feasible. Word embedding or neural network based approaches give good performance in return of heavy computational complexity. So, if you can serve your purpose with a simple and efficient model, why would you prefer a complex and computationally expensive model? You must have very good reasoning about your chosen model. It is not a wise decision to use a model only because the model is popular and widely used.

Machine Learning - Information extraction from a document [closed]

Tags:

machine-learning

tensorflow

nlp

I'm trying to train a couple of neural networks (using tensorflow) to be able to extract semantic information from invoices. After a long list of reading I came up with this:

Use word2vec to generate word embeddings (more on the corpus below).
Feed the output of word2vec to a CNN since vectors that are close together share similar semantic meanings.

So the very high level approach I described above seems quite alright to me. I would love for it to be corrected if anything looks wrong.

A couple of concerns that I have:

Corpus selection. Is it sufficient to use a generic corpus of, for instance, wikipedia? Or should I use a specialized corpus for invoices? If it's the latter, how can I generate this corpus? I do have a big dataset of invoices that I can utilize.
Information extraction. Let's say all of the above work fine and I'm able to understand semantic information from a new unseen invoice. How do I go about extracting certain pieces of information? For instance, let's say we introduce a new invoice that has order number: 12345, assuming order number is understood to be the invoice number (or whatever vectors that lie in the same vicinity of order number), how do I extract the value 12345? One area I was looking at is SyntaxNet that could help here.

Any help/insight is appreciated.

Follow up to @wasi-ahmad's question: The reason I'm trying to understand semantic information about an invoice is to ultimately be able to extract values out of it. So, for instance, if I present an unseen invoice to my neural network it would find the invoice's number (whatever its label is called) and extract its value.

999

asked Nov 22 '16 21:11

Aziz Alfoudari

2 Answers

If you have big dataset of invoices, its better you use that. Dataset has some obvious impact on word embeddings construction. To construct the corpus, you can remove common stop words (like a, the etc.) and then use tf-idf weight of each word to represent a document before feeding them to a skip-gram or CBOW model. You can also use one-hot encoding as an alternative of tf-idf weight. You can also think about simple language model (using bigram or trigram) since you have a very specific domain to work on. This would make your model more simpler!
Your second concern is not clear to me! Usually numeric values are replaced by some label, say NUM during pre-processing step in information extraction task. However, SyntaxNet is actually for dependency parsing. Since your ultimate goal is to extract semantic meaning from invoices, why you need syntactical information? Is it going to help you in this task? If you have a large dataset, you can generate a dictionary for the specific targeted domain. But it depends on how you are going to use the extracted semantic information which you didn't mention in your post!

This is my personal opinion (not aiming to criticize you) that use of word embedding or neural network based models everywhere is not feasible. Word embedding or neural network based approaches give good performance in return of heavy computational complexity. So, if you can serve your purpose with a simple and efficient model, why would you prefer a complex and computationally expensive model? You must have very good reasoning about your chosen model. It is not a wise decision to use a model only because the model is popular and widely used.

187

answered Oct 16 '22 19:10

Wasi Ahmad

I am assuming this is a straight forward extraction problem for invoices. You are proposing a way more complex solution than probably needed--I don't really see how it could work but I don't know everything. Let's step back and start simple:

1) Take at least one example of each type of invoice you expect to process and mark it up with xml like tags that mirror the goal extraction e.g. "Order Number: 12445". XML or other parser can grab it later for the evaluation step if needed or post processing.

2) Think of the simplest way to extract the information--I suggest you start with a regular expression matcher.

3) If a regex matcher is insufficient then you may need some supervised machine learning. This will be able to get more varied phrases and can perform at a very high level of precision and recall for the right phrases. See http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html for a bunch of approaches.

4) If you need more than phrase matching--e.g. part number and part count then you may need to top the stack with a classifier that decides whether the concurrence is legit or not.

Hope that helps.

Breck

answered Oct 16 '22 19:10

Breck Baldwin

Related questions
                            
                                How to decide threshold value in SelectFromModel() for selecting features?
                            
                                Add class information to Generator model in keras
                            
                                How do I train gpt 2 from scratch?
                            
                                Constraining a neural network's output to be within an arbitrary range
                            
                                One stage vs two stage object detection
                            
                                libsvm predict method confusion
                            
                                Using my own corpus for category classification in Python NLTK
                            
                                Testing rules generated by Rpart package
                            
                                Determining optimal number of clusters and Davies–Bouldin Index?
                            
                                Cascade Classifiers for Multiclass Problems in scikit-learn
                            
                                Is it possible to run Python's scikit-learn algorithms over Hadoop? [closed]
                            
                                Clustering a billion items (or which clustering methods run in linear time?)
                            
                                Implementing gradient descent for multiple variables in Octave using "sum"
                            
                                How to explain the outcome of k-means clustering?
                            
                                How do I get TensorFlow's 'import_graph_def' to return Tensors
                            
                                How to hstack several sparse matrices (feature matrices)?
                            
                                How does 'max_samples' keyword for a Bagging classifier effect the number of samples being used for each of the base estimators?
                            
                                Pyspark - Get all parameters of models created with ParamGridBuilder
                            
                                How to perform max pooling on a 1-dimensional ConvNet (conv1d) in TensowFlow?
                            
                                Keras Neural Networks and SKlearn SVM.SVC

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With