Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to extract bold text from pdf using pdfbox?





I am using a Apache pdfbox for extracting text. I can extract the text from pdf but I dont know how to know that whether the word is bold or not??? (code suggestion would be good!!!) Here is the code for extracting plain text from pdf that is working fine.

PDDocument document = PDDocument
if (document.isEncrypted()) {
    try {
    } catch (InvalidPasswordException e) {
        System.err.println("Error: Document is encrypted with a password.");

// PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// stripper.setSortByPosition(true);
PDFTextStripper stripper = new PDFTextStripper();
String st = stripper.getText(document);
like image 217
Lipu Avatar asked Nov 04 '13 15:11


People also ask

What is the use of PDFBox?

It allows the creation of new PDF documents, manipulation of existing documents, bookmarking PDF and the ability to extract content from PDF documents. We can also use it to digitally sign, print and validate files against the PDF/A-1b standard. PDFBox library was originally developed in 2002 by Ben Litchfield.

Is PDFBox thread safe?

Is PDFBox thread safe? No! Only one thread may access a single document at a time. You can have multiple threads each accessing their own PDDocument object.

1 Answers

The result of PDFTextStripper is plain text. After extracting it, therefore, it is too late. But you can override certain methods of it and only let through text which is formatted according to your wishes.

In case of the PDFTextStripper you have to override

protected void processTextPosition( TextPosition text )

In your override you check whether the text in question fulfills your requirements (TextPosition contains much information on the text in question, not only the text itself), and if it does, forward the TextPosition text to the super implementation.

The main problem is, though, to recognize which text is bold.

Criteria for boldness may be the word bold in the font name, e.g. Courier-BoldOblique - you access the font of the text using text.getFont() and the postscript name of the font using the font's getBaseFont() method

String postscriptName = text.getFont().getBaseFont();

Criteria may also be from the font descriptor - you get the font descriptor of a font using the getFontDescriptor method, and a font descriptor has an optional font weight value

float fontWeight = text.getFont().getFontDescriptor().getFontWeight();

The value is defined as

(Optional; PDF 1.5; should be used for Type 3 fonts in Tagged PDF documents) The weight (thickness) component of the fully-qualified font name or font specifier. The possible values shall be 100, 200, 300, 400, 500, 600, 700, 800, or 900, where each number indicates a weight that is at least as dark as its predecessor. A value of 400 shall indicate a normal weight; 700 shall indicate bold.

The specific interpretation of these values varies from font to font.

EXAMPLE 300 in one font may appear most similar to 500 in another.

(Table 122, Section 9.8.1, ISO 32000-1)

There may be additional hints towards bold-ism to check, e.g. a big line width

double lineWidth = getGraphicsState().getLineWidth();

when the rendering mode draws an outline, too:

int renderingMode = getGraphicsState().getTextState().getRenderingMode();

You may have to try with your the documents you have at hand which criteria suffice.

like image 79
mkl Avatar answered Oct 21 '22 04:10
