We need to read some text from photos of sales receipts taken by iPad camera. Here is a sample similar to what we need to read from:

There are a few constraints to this problem:
This is what we have tried so far:
We are now thinking of creating a custom solution using convolutional NN. The question I have is how can we build a model that takes advantage of these two constraints to create a simpler and yet very accurate solution?
This is the general pipeline we have come up with so far.
We are not sure what else to do at this point. Any tips, advice and help will be great.
PS. I realize this is a question about design methodology and not a specific programming question. I apologize if this violates SO guidelines.
I propose for you to consider deeplearning4j.org solution. You can train their network on powerful machine and then save state of network and use it at android. Here they explained how to use their network at android app with help of java.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With