Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PDFBox: extract image location (wrong x and y)

Tags:

java

pdfbox

Hello again fellow programmers.

I can extract PDF text coordinates and its format properly. But I can't do it with image. I can get the proper width and height but it gives me wrong x and y.

I'm using Photoshop to check if I'm getting the proper x, y, width, height coordinates, but only the width and height are correct

Here is my code:

@Override
public void processOperator(Operator operator, List<COSBase> arguments) throws IOException {
    if ("cm".equals(operator.getName())) {
        float width = ((COSNumber)arguments.get(0)).floatValue();
        float height = ((COSNumber)arguments.get(3)).floatValue();
        float x = ((COSNumber)arguments.get(4)).floatValue();
        float y = ((COSNumber)arguments.get(5)).floatValue();
        System.out.println("w: " + width + " h: " + height + " x: " + x + " y: " + y);
        // process image coordinates
    }

    super.processOperator(operator, arguments);
}

And here is the example PDF I used:

http://persci.mit.edu/pub_pdfs/personal_photo_enhancement.pdf

and I'm using the page 2.

This is the output of the program:

w: 503.87997 h: 152.64 x: 71.5168 y: 561.056

I created a rectangle using Photoshop and overlay the image but only the width and height are correct.


Another problem

I used this PDF

http://www.ctex.org/documents/shredder/src/example.pdf

I used the page 17.

Why does the PDF show many coordinates, but the image in the PDF is only one?

w: 1.0 h: 1.0 x: 124.802 y: 776.998
w: 1.0 h: 1.0 x: 0.0 y: 3.587
w: 1.0 h: 1.0 x: 0.0 y: -3.985
w: 1.0 h: 1.0 x: 343.711 y: 0.398
w: 1.0 h: 1.0 x: -343.711 y: -24.906
w: 1.0 h: 1.0 x: 147.972 y: -106.0
w: 1.0 h: 1.0 x: 0.0 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 0.0
w: 0.1 h: 0.1 x: 0.0 y: 0.0
w: 1.0 h: 1.0 x: 45.0 y: 0.0
w: 1.0 h: 1.0 x: -79.37 y: -21.918
w: 1.0 h: 1.0 x: 116.507 y: 0.0
w: 1.0 h: 1.0 x: -230.109 y: -2.145
w: 1.0 h: 1.0 x: 0.0 y: -20.324
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 179.886 y: -66.21
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -215.552 y: -17.195
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: -35.666 y: -76.173
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -4.981 y: -41.843
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -4.981 y: -51.806
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: 175.592 y: -19.925
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -185.554 y: -19.925
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: -37.121
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 282.916 y: -18.389
w: 1.0 h: 1.0 x: 4.981 y: 0.0
w: 1.0 h: 1.0 x: -318.582 y: -17.196
w: 1.0 h: 1.0 x: 0.0 y: -13.682
w: 1.0 h: 1.0 x: 3.387 y: 2.989
w: 1.0 h: 1.0 x: 20.175 y: -2.989
w: 1.0 h: 1.0 x: -23.562 y: -0.398
w: 1.0 h: 1.0 x: 30.685 y: 3.387
w: 1.0 h: 1.0 x: 11.988 y: -11.216
w: 1.0 h: 1.0 x: 0.0 y: -14.833
w: 1.0 h: 1.0 x: 3.388 y: 4.926
w: 1.0 h: 1.0 x: 60.357 y: -4.926
w: 1.0 h: 1.0 x: -63.745 y: -0.399
w: 1.0 h: 1.0 x: 63.944 y: -3.985
w: 1.0 h: 1.0 x: -59.959 y: 0.0
w: 1.0 h: 1.0 x: 64.143 y: 0.0
w: 1.0 h: 1.0 x: -110.801 y: -13.101
w: 1.0 h: 1.0 x: 0.0 y: -2.241
w: 1.0 h: 1.0 x: 39.308 y: 2.241
w: 1.0 h: 1.0 x: 0.0 y: -2.241
w: 1.0 h: 1.0 x: -37.066 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 13.294
w: 1.0 h: 1.0 x: 1.145 y: -9.907
w: 1.0 h: 1.0 x: 39.641 y: 11.302
w: 1.0 h: 1.0 x: 0.0 y: -15.686
w: 1.0 h: 1.0 x: 1.693 y: 14.291
w: 1.0 h: 1.0 x: 0.0 y: -12.896
w: 1.0 h: 1.0 x: 3.288 y: 2.989
w: 1.0 h: 1.0 x: 47.544 y: -2.989
w: 1.0 h: 1.0 x: -50.832 y: -0.299
w: 1.0 h: 1.0 x: 52.227 y: -1.096
w: 1.0 h: 1.0 x: -53.92 y: -0.597
w: 1.0 h: 1.0 x: 57.838 y: 14.888
w: 1.0 h: 1.0 x: 0.0 y: -11.22
w: 1.0 h: 1.0 x: 0.0 y: -2.473
w: 1.0 h: 1.0 x: 42.751 y: 2.473
w: 1.0 h: 1.0 x: 0.0 y: -2.473
w: 1.0 h: 1.0 x: -40.278 y: 0.0
w: 1.0 h: 1.0 x: 0.0 y: 13.693
w: 1.0 h: 1.0 x: 1.313 y: -9.907
w: 1.0 h: 1.0 x: -104.652 y: -78.762
w: 1.0 h: 1.0 x: 166.874 y: 0.0
w: 1.0 h: 1.0 x: 176.837 y: 0.0
like image 879
pdf to image Avatar asked Mar 11 '23 13:03

pdf to image


1 Answers

The cause of the problems

Your code does not really look for image positions and sizes, merely under friendly circumstances it finds them.

Your code only shows a single method without explicit context (which, I presume, is the reason why no one seriously analyzed that code and spotted the issue).

Considering the context (PDFBox, content stream analysis), though, I assume that you created an operator processor class in which you overrode the processOperator method according to the posted code. Furthermore, I assume, you registered your operator processor for the cm instruction with some PDF stream engine and ran that against your sample PDFs.

Given these assumptions it is pretty clear why the output from your operator processor only sometimes contains image size and position but often many unrelated data sets:

The effect of the instruction cm is merely to change the current transformation matrix, it is not immediately or singularly related to drawing bitmap images!

Confer the PDF specification:

Operands Operator Description

a b c d e f cm Modify the current transformation matrix (CTM) by concatenating the specified matrix (see 8.3.2, "Coordinate Spaces"). Although the operands specify a matrix, they shall be written as six separate numbers, not as an array.

(Table 57 – Graphics State Operators – ISO 32000-1)

The only reason why the cm parameters every once in a while do contain image size and position information is that the bitmap drawing operators draw images to an 1x1 area (in user space unit) whose lower left corner is the origin, and to stretch and move the coordinate system so that this area eventually corresponds to desired image size on the result page, PDF processors modify the current transformation matrix accordingly using the cm instruction before drawing the image, often right before.

If they do so in one step (as quoted above cm concatenates the specified matrix to the CTM, it does not replace it) and don't use rotations or similar niceties, a and d (the first and the fourth cm parameters) indeed contain the size of the image on the page (in default user space units) and e and f (the fifth and the sixth cm parameters) contain the coordinates of its lower left corner.

How to do it correctly

Thus, instead of merely looking at the cm parameters, one has to

  • parse the content stream in question,
  • calculate the concatenation of all matrices applied to the CTM (also keeping track of the effects of intermediary q and Q instructions), and
  • retrieve the values of the current transformation matrix when the Do instruction for a bitmap image resource occurs.

Fortunately PDFBox already does all the heavy lifting for you under the hood if you let it, cf. the PrintImageLocations examples at

  • (for PDFBox 1.8.13) https://svn.apache.org/repos/asf/pdfbox/tags/1.8.13/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java
  • (for PDFBox 2.0.3) https://svn.apache.org/repos/asf/pdfbox/tags/2.0.3/examples/src/main/java/org/apache/pdfbox/examples/util/PrintImageLocations.java

Concerning your questions

The coordinates you got for "personal_photo_enhancement.pdf" page 2 were correct as far as the PDF coordinate system is concerned. Probably Photoshop uses a different coordinate system or you inspected the wrong image corner.

You got very many outputs for "example.pdf" page 17 because that PDF uses CTM manipulations not only for sizing and positioning images but for other effects, too, mostly for translating the coordinate system origin. Futhermore, the image on that page is not a bitmap. Thus, it does not have a simple position and size...

like image 149
mkl Avatar answered Mar 19 '23 20:03

mkl