Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFBox PDFTextStripperByArea region coordinates

Tags:

pdfbox

In what dimensions and direction is the Rectangle in the

PDFTextStripperByArea's function addRegion(String regionName, Rectangle2D rect).

In other words, where does the rectangle R start and how big is it (dimensions of the origin values, dimensions of the rectangle) and in what direction does it go (direction of the blue arrows in illustration), if new Rectangle(10,10,100,100) is given as a second parameter?

PdfBox rectangle

like image 564
ipavlic Avatar asked Dec 15 '11 07:12

ipavlic


3 Answers

new Rectangle(10,10,100,100)

means that the rectangle will have its upper-left corner at position (10, 10), so 10 units far from the left and the top of the PDF document. Here a "unit" is 1 pt = 1/72 inch.

The first 100 represents the width of the rectangle and the second one its height. To sum up, the right picture is the first one.

I wrote this code to extract some areas of a page given as arguments to the function:

Rectangle2D region = new Rectangle2D.Double(x, y, width, height);
String regionName = "region";
PDFTextStripperByArea stripper;

stripper = new PDFTextStripperByArea();
stripper.addRegion(regionName, region);
stripper.extractRegions(page);

So, x and y are the absolute coordinates of the upper-left corner of the Rectangle and then you specify its width and height. page is a PDPage variable given as argument to this function.

like image 112
Nicolas W. Avatar answered Nov 11 '22 18:11

Nicolas W.


Was looking into doing something like this, so I thought I'd pass what I found along.

Here's the code for creating my original pdf using itext.

import com.lowagie.text.Document
import com.lowagie.text.Paragraph
import com.lowagie.text.pdf.PdfWriter

class SimplePdfCreator {
    void createFrom(String path) {
        Document d = new Document()
        try {
            PdfWriter writer = PdfWriter.getInstance(d, new FileOutputStream(path))
            d.open()
            d.add(new Paragraph("This is a test."))
            d.close()
        } catch (Exception e) {
            e.printStackTrace()
        }
    }
}

If you crack open the pdf, you'll see the text in the upper left hand corner. Here's the test showing what you are looking for.

@Test
void createFrom_using_pdf_box_to_extract_text_targeted_extraction() {
    new SimplePdfCreator().createFrom("myFileLocation")
    def doc = PDDocument.load("myFileLocation")
    Rectangle2D.Double d = new Rectangle2D.Double(0, 0, 120, 100)
    def stripper = new PDFTextStripperByArea()
    def pages = doc.getDocumentCatalog().allPages
    stripper.addRegion("myRegion", d)
    stripper.extractRegions(pages[0])
    assert stripper.getTextForRegion("myRegion").contains("This is a test.")
}

Position (0, 0) is the upper left hand corner of the document. The width and height are heading down and to the right. I was able to trim down the range a bit to (35, 52, 120, 3) and still get the test to pass.

All code is written in groovy.

like image 43
benkiefer Avatar answered Nov 11 '22 19:11

benkiefer


Code in java using PDFBox.

 public String fetchTextByRegion(String path, String filename, int pageNumber) throws IOException {
        File file = new File(path + filename);
        PDDocument document = PDDocument.load(file);
        //Rectangle2D region = new Rectangle2D.Double(x,y,width,height);
        Rectangle2D region = new Rectangle2D.Double(0, 100, 550, 700);
        String regionName = "region";
        PDFTextStripperByArea stripper;
        PDPage page = document.getPage(pageNumber + 1);
        stripper = new PDFTextStripperByArea();
        stripper.addRegion(regionName, region);
        stripper.extractRegions(page);
        String text = stripper.getTextForRegion(regionName);
        return text;
    }
like image 38
Vivek joshi Avatar answered Nov 11 '22 20:11

Vivek joshi