I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line. I can't find anything related to how to get that information though. I know pdfbox has a class called TextPosition, but I can't find out how to get a TextPosition object from the PDDocument either. How do I get the location information of a line of text from a pdf?
Try running "Preflight..." in Acrobat and choosing PDF Analysis -> List page objects, grouped by type of object . If you locate the text objects within the results list, you will notice there is a position value (in points) within the Text Properties -> * Font section.
Aside of performance keep in mind that iTextPdf is licensed under AGPL which can be too restrictive. The README on github explicitly mentions that if you use it you should distribute your software under AGPL or use a paid license. On the other hand PdfBox is licensed under Apache License which suits in most cases.
To extract text (with or without extra information like positions, colors, etc.) using PDFBox, you instantiate a PDFTextStripper
or a class derived from it and use it like this:
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
(There are a number of PDFTextStripper
attributes allowing you to restrict the pages text is extracted from.)
In the course of the execution of getText
the content streams of the pages in question (and those of form xObjects referenced from those pages) are parsed and text drawing commands are processed.
If you want to change the text extraction behavior, you have to change this text drawing command processing which you most often should do by overriding this method:
/**
* Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
* and just calls {@link #writeString(String)}.
*
* @param text The text to write to the stream.
* @param textPositions The TextPositions belonging to the text.
* @throws IOException If there is an error when writing the text.
*/
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
writeString(text);
}
If you additionally need to know when a new line starts, you may also want to override
/**
* Write the line separator value to the output stream.
* @throws IOException
* If there is a problem writing out the lineseparator to the document.
*/
protected void writeLineSeparator( ) throws IOException
{
output.write(getLineSeparator());
}
writeString
can be overridden to channel the text information into separate members (e.g. if you might want a result in a more structured format than a mere String
) or it can be overridden to simply add some extra information into the result String
.
writeLineSeparator
can be overridden to trigger some specific output between lines.
There are more methods which can be overridden but you are less likely to need them in general.
I'm using PDFBox to extract information from a pdf, and the information I'm currently trying to find is related to the x-position of the first character in the line.
This can be implemented as follows (simply adding the information at the start of each line):
PDFTextStripper stripper = new PDFTextStripper()
{
@Override
protected void startPage(PDPage page) throws IOException
{
startOfLine = true;
super.startPage(page);
}
@Override
protected void writeLineSeparator() throws IOException
{
startOfLine = true;
super.writeLineSeparator();
}
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
if (startOfLine)
{
TextPosition firstProsition = textPositions.get(0);
writeString(String.format("[%s]", firstProsition.getXDirAdj()));
startOfLine = false;
}
super.writeString(text, textPositions);
}
boolean startOfLine = true;
};
text = stripper.getText(document);
(ExtractText.java method extractLineStart
tested by testExtractLineStartFromSampleFile
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With