My question is:
How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?
Background: I work on a project about text analyses (especially scientific texts). These texts sometimes are published in muliple column layouts with each column given a separate page number. To order the extracted text by the layouted pagenumbers it would be useful to extract the text by columns.
I use pdfBox and tried / searched for several things:
getThreadBeads() method of the PDPage class -> result: list with 0 sizegetCharactersByArticle() method -> text not divided in columnsThe thing is that pdfBox seems to divide the text by columns automatically:
If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns.
But if I set setSortByPosition() on false the stripper is doing this division.
For that I had a look to the pdfBox source code:
The crucial method is the writePage() method of PDFTextStripper.
Here spaces (which are not given in most pdfs) and line breaks are calculated obviously.
But I couldn't find how the Stripper is calculating the column breaks.
So the questions again:
thanks in advance
If I set setSortByPosition() of a PDFTextStripper on true all signs of a page are set in a line without recognizing separate columns. But if I set setSortByPosition() on false the stripper is doing this division.
[...] How is PDFTextStripper calculating column breaks?
It isn't.
By setting SortByPosition to false you tell PDFBox to not try to sort the text pieces from the page content stream but to instead accept them in the order they appear.
In your document the text pieces seem to be drawn in the reading order, i.e. column by column. This is not true for all documents, and to cope with other documents PDFBox offers the option of sorting the text pieces left-to-right, top-to-bottom.
Activating that option (setting SortByPosition to true) in your document returns the text without respect to the columns.
Are there methods in the pdfBox API to catch this / to extract the text by columns?
PDFBox does not analyze the page content to recognize columns. If you do the analysis, though, it allows you to extract text column by column if you provide the column rectangles as reguions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With