Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split A4 page into A7 sections [closed]

Tags:

java

pdf

pdfbox

I'm given a document of A4 pages with 8 A7 sections on each page. I need to extract the data from each A7 area of each page because they're related.

Is it possible to break each A4 in 8 A7 and go through the data?

This is the PDF file I'm dealing with: https://s3.us-east-2.amazonaws.com/s3.barcodegen-website.io/programada+pdf+teste.pdf

(Regarding A4/A7 paper sizes, see ISO 216 at Wikipedia.)

like image 263
Alan Rodrigues Avatar asked Oct 16 '25 19:10

Alan Rodrigues


2 Answers

Splitting PDF pages raises a number of secondary issues like what will you do with half a "glyph" (or half a hyperlink) Thus internal hyperlinks will usually be discarded but perhaps externals need keeping.

We need to test for duplication of resources so a source A4 of 526 KB (539,607 bytes) may actually become slightly different as 537 KB (550,093 bytes) which sometimes is oddly smaller but here only slightly larger!

enter image description here

Using an image approach is not acceptable as clearly at this scale the Bar codes are likely to be destroyed.

Image Left (Notice the bad infill), Vector Right is accurate for scanning.

enter image description here

Cropped duplication is not always a good solution as there can be overlapping contents per page. However in this case that can be broken by a decimation into 4 x 2 pages, Seen here in facing pairs. We may also see at that stage the offsets vary and are not perfect for such splitting. Thus the source positions either need alter or the page boundary sliding in different directions.

enter image description here

Corrected Result as seen in Acrobat Reader etc.
mutool poster -x 4 -y 2 -r programada.pdf output.pdf

enter image description here

Nearest to desired cropping is

cpdf -shift-boxes "-20 0" TOTVS.pdf -o tempout1.pdf
cpdf -chop "4 2" tempout1.pdf -o tempout2.pdf
mutool trim -b MediaBox -o final.pdf tempout2.pdf

or

cpdf -shift-boxes "-20 0" TOTVS.pdf -o tempout1.pdf
mutool poster -x 4 -y 2 -r tempout1.pdf tempout2.pdf
mutool trim -b MediaBox -o final.pdf tempout2.pdf

These should produce similar cleaner A7 size pages.

like image 77
K J Avatar answered Oct 18 '25 08:10

K J


Because I don't know what you really mean by "extract the data from each A7 area of each page because they're related", I start with proposing a solution that (at least) visually splits the single A$ pdf page into eight A7 ones, following this disposition:

+---+---+---+---+
| 1 | 2 | 3 | 4 |
|   |   |   |   |
+---+---+---+---+
| 5 | 6 | 7 | 8 |
|   |   |   |   |
+---+---+---+---+

Basically, it crops the current page using a moving window and imports it into a page of the new document.

public void splitPdf(String pdfFileName) throws IOException {
  File pdfFile = new File(pdfFileName);
  File pdfTargetFile = new File(pdfFileName + ".a7.pdf");

  try (PDDocument pdfDocument = Loader.loadPDF(new RandomAccessReadBufferedFile(pdfFile));
      PDDocument pdfTargetDocument = new PDDocument(); ) {

    for (PDPage pdfPage : pdfDocument.getPages()) {
      PDRectangle cropBox = pdfPage.getCropBox();
      float upperRightX = cropBox.getUpperRightX();
      float upperRightY = cropBox.getUpperRightY();
      for (int j = 0; j < 4; ++j) {
        for (int i = 0; i < 2; ++i) {
          float cropLowerLeftX = upperRightX / 4 * j;
          float cropUpperRightX = upperRightX / 4 * (j + 1);
          float cropLowerLeftY = upperRightY / 2 * i;
          float cropUpperRightY = upperRightY / 2 * (i + 1);
          cropBox.setLowerLeftX(cropLowerLeftX);
          cropBox.setUpperRightX(cropUpperRightX);
          cropBox.setLowerLeftY(cropLowerLeftY);
          cropBox.setUpperRightY(cropUpperRightY);
          pdfPage.setCropBox(cropBox);
          pdfTargetDocument.importPage(pdfPage);
        }
      }
    }
    
    pdfTargetDocument.save(pdfTargetFile);
  }
}

Anyway, this is not really a "splitting", since in fact each page of the new document contains all the data of the original one, but it is just "bejond page borders", it's because of this I called this a "visual" splitting.

like image 20
Francesco Poli Avatar answered Oct 18 '25 08:10

Francesco Poli