Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get font of each line using PDFBox

Tags:

pdf

fonts

pdfbox

Is there a way to get the font of each line of a PDF file using PDFBox? I have tried this but it just lists all the fonts used in that page. It does not show what line or text is showed in that font.

List<PDPage> pages = doc.getDocumentCatalog().getAllPages();
for(PDPage page:pages)
{
Map<String,PDFont> pageFonts=page.getResources().getFonts();
for(String key : pageFonts.keySet())
   {
    System.out.println(key+" - "+pageFonts.get(key));
    System.out.println(pageFonts.get(key).getBaseFont());
    }
}

Any input is appreciated. Thanks!

like image 430
user3023239 Avatar asked Nov 30 '22 02:11

user3023239


2 Answers

Whenever you try to extract text (plain or with styling information) from a PDF using PDFBox, you generally should start trying using the PDFTextStripper class or one of its relatives. This class already does all the heavy lifting involved in PDF content parsing for you.

You use the plain PDFTextStripper class like this:

PDDocument document = ...;
PDFTextStripper stripper = new PDFTextStripper();
// set stripper start and end page or bookmark attributes unless you want all the text
String text = stripper.getText(document);

This returns merely the plain text, e.g. from some R40 form:

Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please 
contact us.
Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact 
you if we need these.
Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

You can, on the other hand, overwrite its method writeString(String, List<TextPosition>) and process more information than the mere text. To add information on the name of the used font wherever the font changes, you can use this:

PDFTextStripper stripper = new PDFTextStripper() {
    String prevBaseFont = "";

    protected void writeString(String text, List<TextPosition> textPositions) throws IOException
    {
        StringBuilder builder = new StringBuilder();

        for (TextPosition position : textPositions)
        {
            String baseFont = position.getFont().getBaseFont();
            if (baseFont != null && !baseFont.equals(prevBaseFont))
            {
                builder.append('[').append(baseFont).append(']');
                prevBaseFont = baseFont;
            }
            builder.append(position.getCharacter());
        }

        writeString(builder.toString());
    }
};

For the same form you get

[DHSLTQ+IRModena-Bold]Claim for repayment of tax deducted 
from savings and investments
How to fill in this form
[OIALXD+IRModena-Regular]Please fill in this form with details of your income for the
above tax year. The enclosed Notes will help you (but there is
not a note for every box on the form). If you need more help
with anything on this form, please phone us on the number
shown above.
If you are not a UK resident, do not use this form – please 
contact us.
[DHSLTQ+IRModena-Bold]Please do not send us any personal records, or tax
certificates or vouchers with your form. We will contact 
you if we need these.
[OIALXD+IRModena-Regular]Please allow four weeks before contacting us about your
repayment. We will pay you as quickly as possible.
Use black ink and capital letters
Cross out any mistakes and write the
correct information below
...

If you don't want the font information to be merged with the text, simply create separate structures in your method overwrite.

TextPosition offers a lot more information on the piece of text it represents. Inspect it!

like image 187
mkl Avatar answered Dec 04 '22 10:12

mkl


To add onto mkl's answer, if you are using pdfbox 2.0.8:

  • Use position.getFont().getName() instead of position.getFont().getBaseFont()
  • Use position.getUnicode() instead of position.getCharacter()

More information on PDFont and Text Position can be found on their Javadocs online.

like image 32
Todd Avatar answered Dec 04 '22 09:12

Todd