How to extract bold text from pdf using pdfbox?

Tags:

I am using a Apache pdfbox for extracting text. I can extract the text from pdf but I dont know how to know that whether the word is bold or not??? (code suggestion would be good!!!) Here is the code for extracting plain text from pdf that is working fine.

PDDocument document = PDDocument
    .load("/home/lipu/workspace/MRCPTester/test.pdf");
document.getClass();
if (document.isEncrypted()) {
    try {
        document.decrypt("");
    } catch (InvalidPasswordException e) {
        System.err.println("Error: Document is encrypted with a password.");
        System.exit(1);
    }
}

// PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// stripper.setSortByPosition(true);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
stripper.setSortByPosition(true);
String st = stripper.getText(document);

217

asked Nov 04 '13 15:11

Lipu

1 Answers

The result of PDFTextStripper is plain text. After extracting it, therefore, it is too late. But you can override certain methods of it and only let through text which is formatted according to your wishes.

In case of the PDFTextStripper you have to override

protected void processTextPosition( TextPosition text )

In your override you check whether the text in question fulfills your requirements (TextPosition contains much information on the text in question, not only the text itself), and if it does, forward the TextPosition text to the super implementation.

The main problem is, though, to recognize which text is bold.

Criteria for boldness may be the word bold in the font name, e.g. Courier-BoldOblique - you access the font of the text using text.getFont() and the postscript name of the font using the font's getBaseFont() method

String postscriptName = text.getFont().getBaseFont();

Criteria may also be from the font descriptor - you get the font descriptor of a font using the getFontDescriptor method, and a font descriptor has an optional font weight value

float fontWeight = text.getFont().getFontDescriptor().getFontWeight();

The value is defined as

(Optional; PDF 1.5; should be used for Type 3 fonts in Tagged PDF documents) The weight (thickness) component of the fully-qualified font name or font specifier. The possible values shall be 100, 200, 300, 400, 500, 600, 700, 800, or 900, where each number indicates a weight that is at least as dark as its predecessor. A value of 400 shall indicate a normal weight; 700 shall indicate bold.

The specific interpretation of these values varies from font to font.

EXAMPLE 300 in one font may appear most similar to 500 in another.

(Table 122, Section 9.8.1, ISO 32000-1)

There may be additional hints towards bold-ism to check, e.g. a big line width

double lineWidth = getGraphicsState().getLineWidth();

when the rendering mode draws an outline, too:

int renderingMode = getGraphicsState().getTextState().getRenderingMode();

You may have to try with your the documents you have at hand which criteria suffice.

answered Oct 21 '22 04:10

mkl

Related questions
                            
                                Deserialize multiple Java Objects
                            
                                How can I display a simple notification in android? [closed]
                            
                                Convert java.util.date default format to Timestamp in Java
                            
                                Are Java wait(), notify() 's implementation significantly different from locks?
                            
                                Why is Wrapper Integer to Float conversion not possible in java
                            
                                Aspect weaving at runtime
                            
                                PSQLException: this ResultSet is closed
                            
                                Difference between requestFocusInWindow() and grabFocus() in Swing
                            
                                Operator precedence issue leads to "error: unexpected type"
                            
                                Checkers Game: Not Error Checking Correctly?
                            
                                SpringFramework: instantiation exception
                            
                                Java/Wicket: Compile Basic Hello World with Resources
                            
                                androidstudio set java version 1.7
                            
                                Why would someone intentionally implement the default implementation of the default constructor?
                            
                                Implementing custom compareTo
                            
                                What is Map.Entry<K,V> interface?
                            
                                Syntax error on token "package", import expected (Java)
                            
                                Size has private access in ArrayList
                            
                                replace all occurrences of a character in a string in java? [duplicate]
                            
                                Spring MVC @RequestBody receive an Object wrapper with non-primitive attributes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to extract bold text from pdf using pdfbox?

Tags:

java

pdf

pdfbox

Lipu

People also ask

1 Answers

mkl

Recent Activity

Donate For Us