I'm given a document of A4 pages with 8 A7 sections on each page. I need to extract the data from each A7 area of each page because they're related.
Is it possible to break each A4 in 8 A7 and go through the data?
This is the PDF file I'm dealing with: https://s3.us-east-2.amazonaws.com/s3.barcodegen-website.io/programada+pdf+teste.pdf
(Regarding A4/A7 paper sizes, see ISO 216 at Wikipedia.)
Splitting PDF pages raises a number of secondary issues like what will you do with half a "glyph" (or half a hyperlink) Thus internal hyperlinks will usually be discarded but perhaps externals need keeping.
We need to test for duplication of resources so a source A4 of 526 KB (539,607 bytes) may actually become slightly different as 537 KB (550,093 bytes) which sometimes is oddly smaller but here only slightly larger!
Using an image approach is not acceptable as clearly at this scale the Bar codes are likely to be destroyed.
Image Left (Notice the bad infill), Vector Right is accurate for scanning.
Cropped duplication is not always a good solution as there can be overlapping contents per page. However in this case that can be broken by a decimation into 4 x 2 pages, Seen here in facing pairs. We may also see at that stage the offsets vary and are not perfect for such splitting. Thus the source positions either need alter or the page boundary sliding in different directions.
Corrected Result as seen in Acrobat Reader etc.
mutool poster -x 4 -y 2 -r programada.pdf output.pdf
Nearest to desired cropping is
cpdf -shift-boxes "-20 0" TOTVS.pdf -o tempout1.pdf
cpdf -chop "4 2" tempout1.pdf -o tempout2.pdf
mutool trim -b MediaBox -o final.pdf tempout2.pdf
or
cpdf -shift-boxes "-20 0" TOTVS.pdf -o tempout1.pdf
mutool poster -x 4 -y 2 -r tempout1.pdf tempout2.pdf
mutool trim -b MediaBox -o final.pdf tempout2.pdf
These should produce similar cleaner A7 size pages.
Because I don't know what you really mean by "extract the data from each A7 area of each page because they're related", I start with proposing a solution that (at least) visually splits the single A$ pdf page into eight A7 ones, following this disposition:
+---+---+---+---+
| 1 | 2 | 3 | 4 |
| | | | |
+---+---+---+---+
| 5 | 6 | 7 | 8 |
| | | | |
+---+---+---+---+
Basically, it crops the current page using a moving window and imports it into a page of the new document.
public void splitPdf(String pdfFileName) throws IOException {
File pdfFile = new File(pdfFileName);
File pdfTargetFile = new File(pdfFileName + ".a7.pdf");
try (PDDocument pdfDocument = Loader.loadPDF(new RandomAccessReadBufferedFile(pdfFile));
PDDocument pdfTargetDocument = new PDDocument(); ) {
for (PDPage pdfPage : pdfDocument.getPages()) {
PDRectangle cropBox = pdfPage.getCropBox();
float upperRightX = cropBox.getUpperRightX();
float upperRightY = cropBox.getUpperRightY();
for (int j = 0; j < 4; ++j) {
for (int i = 0; i < 2; ++i) {
float cropLowerLeftX = upperRightX / 4 * j;
float cropUpperRightX = upperRightX / 4 * (j + 1);
float cropLowerLeftY = upperRightY / 2 * i;
float cropUpperRightY = upperRightY / 2 * (i + 1);
cropBox.setLowerLeftX(cropLowerLeftX);
cropBox.setUpperRightX(cropUpperRightX);
cropBox.setLowerLeftY(cropLowerLeftY);
cropBox.setUpperRightY(cropUpperRightY);
pdfPage.setCropBox(cropBox);
pdfTargetDocument.importPage(pdfPage);
}
}
}
pdfTargetDocument.save(pdfTargetFile);
}
}
Anyway, this is not really a "splitting", since in fact each page of the new document contains all the data of the original one, but it is just "bejond page borders", it's because of this I called this a "visual" splitting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With