I am using a Apache pdfbox for extracting text. I can extract the text from pdf but I dont know how to know that whether the word is bold or not??? (code suggestion would be good!!!) Here is the code for extracting plain text from pdf that is working fine.
PDDocument document = PDDocument
.load("/home/lipu/workspace/MRCPTester/test.pdf");
document.getClass();
if (document.isEncrypted()) {
try {
document.decrypt("");
} catch (InvalidPasswordException e) {
System.err.println("Error: Document is encrypted with a password.");
System.exit(1);
}
}
// PDFTextStripperByArea stripper = new PDFTextStripperByArea();
// stripper.setSortByPosition(true);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(2);
stripper.setSortByPosition(true);
String st = stripper.getText(document);
It allows the creation of new PDF documents, manipulation of existing documents, bookmarking PDF and the ability to extract content from PDF documents. We can also use it to digitally sign, print and validate files against the PDF/A-1b standard. PDFBox library was originally developed in 2002 by Ben Litchfield.
Is PDFBox thread safe? No! Only one thread may access a single document at a time. You can have multiple threads each accessing their own PDDocument object.
The result of PDFTextStripper
is plain text. After extracting it, therefore, it is too late. But you can override certain methods of it and only let through text which is formatted according to your wishes.
In case of the PDFTextStripper
you have to override
protected void processTextPosition( TextPosition text )
In your override you check whether the text in question fulfills your requirements (TextPosition
contains much information on the text in question, not only the text itself), and if it does, forward the TextPosition text
to the super
implementation.
The main problem is, though, to recognize which text is bold.
Criteria for boldness may be the word bold in the font name, e.g. Courier-BoldOblique - you access the font of the text using text.getFont()
and the postscript name of the font using the font's getBaseFont()
method
String postscriptName = text.getFont().getBaseFont();
Criteria may also be from the font descriptor - you get the font descriptor of a font using the getFontDescriptor
method, and a font descriptor has an optional font weight value
float fontWeight = text.getFont().getFontDescriptor().getFontWeight();
The value is defined as
(Optional; PDF 1.5; should be used for Type 3 fonts in Tagged PDF documents) The weight (thickness) component of the fully-qualified font name or font specifier. The possible values shall be 100, 200, 300, 400, 500, 600, 700, 800, or 900, where each number indicates a weight that is at least as dark as its predecessor. A value of 400 shall indicate a normal weight; 700 shall indicate bold.
The specific interpretation of these values varies from font to font.
EXAMPLE 300 in one font may appear most similar to 500 in another.
(Table 122, Section 9.8.1, ISO 32000-1)
There may be additional hints towards bold-ism to check, e.g. a big line width
double lineWidth = getGraphicsState().getLineWidth();
when the rendering mode draws an outline, too:
int renderingMode = getGraphicsState().getTextState().getRenderingMode();
You may have to try with your the documents you have at hand which criteria suffice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With