PDFBox extracting paragraphs

Tags:

pdfbox

I am new to pdfbox and I want to extract a paragraph that matches some particular words and I am able to extract the whole pdf to text(notepad) but I have no idea of how to extract particular paragraph to my java program. Can anyone help me with this atleast some tutorials or examples.Thank you so much

684

asked Feb 26 '12 07:02

scc

2 Answers

Text in PDF documents is absolutely positioned. So instead of words, lines and paragraphs, one only has absolutely positioned characters.

Let's say you have a paragraph:

Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit

Roughly speaking, in the PDF file it will be represented as characters N at some position, e a bit right to it, q, u, e more to the right, etc.

PDFBox tries to guess how the characters make words, lines and paragraphs. So it will look for a lot of characters at approximately same vertical position, for groups of characters that are near to each other and similar to try and find what you need. It does that by extracting the text from the entire page and then processing it character by character to create text (it can also try and extract text from just one rectangular area inside a page). See the appropriate class PDFTextStripper (or PDFTextStripperByArea). For usage, see ExtractText.java in PDFBox sources.

That means that you cannot extract paragraphs easily using PDFBox. It also means that PDFBox can and sometimes will miss when extracting text (there are a lot of very different PDF documents out there).

What you can do is extract text from the entire page and then try and find your paragraph searching through that text. Regular expressions are usually well suited for such tasks (available in Java either through Pattern and Matcher classes, or convenience methods on String class).

142

answered Sep 20 '22 08:09

ipavlic

public static void main(String[] args) throws InvalidPasswordException, IOException {
    File file = new File("File Path");
    PDDocument document = PDDocument.load(file);        
    PDFTextStripper pdfStripper = new PDFTextStripper();
    pdfStripper.setParagraphStart("/t");
    pdfStripper.setSortByPosition(true);


    for (String line: pdfStripper.getText(document).split(pdfStripper.getParagraphStart()))
            {
                System.out.println(line);
                System.out.println("********************************************************************");
            }
}

Guys please try the above code. This works for sure with PDFBox-2.0.8 Jar

answered Sep 21 '22 08:09

aavos

Related questions
                            
                                How to load a password protected PDF form using PDFBOX
                            
                                PDF table extraction
                            
                                Java: Create PDF pages from images using PDFBox 1 library
                            
                                In PDFBox, how to change the origin (0,0) point of a PDRectangle object?
                            
                                How to add .png images to pdf using Apache PDFBox
                            
                                How to sign pdf in Java using pdfbox
                            
                                How can I create an accessible PDF with Java PDFBox 2.0.8 library that is also verifiable with PAC 2 tool?
                            
                                PDFBox adding white spaces within words
                            
                                Disabling logging on PDFBox
                            
                                PDFBOX : U+000A ('controlLF') is not available in this font Helvetica encoding: WinAnsiEncoding
                            
                                PDFBox setting A5 page size
                            
                                Getting java.lang.NoClassDefFoundError: org/pdfbox/pdfparser/
                            
                                PDFBox PDFTextStripperByArea region coordinates
                            
                                How to create a PDF file from HTML using PDFBox?
                            
                                Is it possible to justify text in PDFBOX?
                            
                                PDF Library for Android - PDFBox? [closed]
                            
                                Drawing vector images on PDF with PDFBox
                            
                                Using PDFBox to write UTF-8 encoded strings to a PDF [duplicate]
                            
                                Copy+pasting text from PDF results in garbage
                            
                                How to add hyperlink in pdf using pdfbox

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With