Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

aws textract - Group output lines by parragraph

I've started experimenting with aws-textract, specifically with detect-document-text (Docs: https://docs.aws.amazon.com/textract/latest/dg/detecting-document-text.html). For one example, where the image content is:

This is the first line
should continue here.

This is the second line.

detect-document-text output, is returning a JSON, where each BlockType node is either WORD, LINE or PAGE (Some other elements are attached like, Relationships where is defined the type and a list of Id's, Geometry information (coordinates), Confidence, etc). In this case, output will contain a BlockType (LINE) for each row (as expected), something like this:

{
...
  {
    ...
    "BlockType": "LINE",
    "Confidence": 97.8960189819336,
    "Text": "This is the first line",
    ...
  },
  {
    ...
    "BlockType": "LINE",
    "Confidence": 97.8960189819336,
    "Text": "should continue here.",
   ...
  },
  {
    ...
    "BlockType": "LINE",
    "Confidence": 97.8960189819336,
    "Text": "This is the second line.",
   ...
  },
  ...
}

My question is the next, is there a parameter that can be overwritten (like span value for rows or cells to keep a single node by "sentence") or a kind option to group lines by paragraph (based on calculated coordinates) with the intention to have full sentences? Or is this a mandatory post-processing from client side? Wondering, seems to be a common scenario, so trying to find if it's already offered by textract or some other aws service using textract output JSON.

like image 918
Cesar A. Mostacero Avatar asked Nov 06 '25 06:11

Cesar A. Mostacero


1 Answers

As mentioned on syumaK's answer, this is not supported by the Textstract API. Consider maybe using alternative services like Google Vision API which often gives you whole paragraphs rather than just lines.

Alternatively, consider how text is normally laid out on a page. Lines part of the same paragraph tend to have similar-ish widths as well as similar heights, they will either share similar left, center or right x-locations depending on alignment used and generally the separation between lines in the y-direction will be less than 2 times the height of the line. You can limit your search to single pages at a time. Might benefit from building a spatial search index like an r-tree to improve the page search speed.

No code sorry, but that should form a pretty good skeleton for building out the line block aggregation function.

like image 152
arkore Avatar answered Nov 09 '25 10:11

arkore