this is my first question on StackOverFlow, I would like to extract key-value pairs (FORMS) from a (scanned) PDf document via Amazon Textract. What I have noticed, however, is that some key-value pairs returned by the webapp demo (https://us-east-2.console.aws.amazon.com/textract/home?region=us-east-2#/demo) are absent from the methods that can be implemented in the code.
Furthermore, between these two methods, the Synchronous method (AnalyzeDocumentRequest), which does not accept PDF but forces a pre-conversion of the document into an image, in turn finds key-value pairs (Sync Result Example) which the Asynchronous method does not. (Async Result Example)
The problem is similar to this guy's, when he talks about the difference in results between the two methods of analyzing the document : AWS Textract - GetDocumentAnalysisRequest only returns correct results for first page of document
The code implementation is equal to these example:
Has anyone ever had the same problem?
We had this problem recently. The demo website provided by AWS found 50 fields, our own code using the provided API yielded 30 fields.
After some trial land error and a lot of googling we found that the response returned by GetDocumentAnalysisAsync included a NextToken which is used to ask for more results. Turns out we had to call GetDocumentAnalysisAsync again with this token (rinse-and-repeat) until the result response no longer included a NextToken.
At that point we knew we had all the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With