I am using the Google vision api to perform text recognition on receipt images. I am getting some nice results returned but the format in which the return is quite unreliable. If there is a large gap between text the readout will print the line below instead of the line next to it.
For example, with the following Recipt Image i get the below response:
4x Löwenbräu Original a 3,00 12,00 1
8x Weissbier dunkel a 3,30 26,401
3x Hefe-Weissbier a 3,30 9,90 1
1x Saft 0,25
1x Grosses Wasser
1x Vegetarische Varia
1x Gyros
1x Baby Kalamari Gefu
2x Gyros Folie
1x Schafskäse Ofen
1x Bifteki Metaxa
1x Schweinefilet Meta
1x St ifado
1x Tee
2,50 1
2,40 1
9,90 1
8,90 1
12,90
a 9,9019,80 1
6,90 1
11,90 1
13,90 1
14,90 1
2,10 1
Which starts of well and as expected but then becomes fairly un helpful when trying to connect prices to text etc. The ideal response would be as follows:
4x Löwenbräu Original a 3,00 12,00 1
8x Weissbier dunkel a 3,30 26,401
3x Hefe-Weissbier a 3,30 9,90 1
1x Saft 0,25 2,50 1
1x Grosses Wasser 2,40 1
1x Vegetarische Varia 9,90 1
1x Gyros 8,90 1
1x Baby Kalamari Gefu 12,90 1
2x Gyros Folie a 9,9019,80 1
1x Schafskäse Ofen 6,90 1
1x Bifteki Metaxa 11,90 1
1x Schweinefilet Meta 13,90 1
1x St ifado 14,90 1
1x Tee 2,10 1
Or close to that.
Is there a formatting request you can add to the api to get different responses? I have had success when using tessereact where you can change the output format to achieve this result and was wondering if the vision api has something similar.
I understand the api returns letter coordinates which could be used but i was hoping not to have to go into that kind of depth.
This might be a late answer but adding it for future reference. For text which are very far apart the DOCUMENT_TEXT_DETECTION also does not provide proper line segmentation.
The following code does simple line segmentation based on the character polygon coordinates.
https://github.com/sshniro/line-segmentation-algorithm-to-gcp-vision
You can add feature
hints to your JSON request. For image of a receipt like this, DOCUMENT_TEXT_DETECTION
give good results:
{
"requests": [
{
"image": {
"source": {
"imageUri": "https://i.stack.imgur.com/TRTXo.png"
}
},
"features": [
{
"type": "DOCUMENT_TEXT_DETECTION"
}
]
}
]
}
You can copy the above JSON and paste it into Request Body in the Try This API pane on the documentation page. Result:
4x LOwenbräu Original a 3,00 12,00 1
8x Weissbier dunkel a 3, 3026, 40 1
3x Hefe-Weissbier a 3,30990 1
1x Saft 0,25 2, 50 1
1x Grosses Wasser 2, 40 1
1x Vegetarische Varia 9,90 1
1x Gyros 8,90 1
1x Baby Kalamari Gefu 12,90 !
2x Gyros Folie a 9,9019, 80 1
1x Schaf skäse Ofen 6,90 1
1x Bifteki Metaxa 11,90 1
1x Schweinefilet Meta 13,90 1
1x Stifado 14, 90 1
1x Tee 2, 10 1
Googie Vision is much less configurable than Tesseract at the moment. Because Google is behind both projects, guess which one gonna get higher priority in the future?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With